I’ve been having a small problem on one of my server with the http daemon dying every week or two. It’s not often enough to be a huge problem or invest a lot of time in by enough of a nuisance to require a fix. So what I ended up doing was installing monit to look after things.
monit is a simple daemon that checks on server resources ( mainly services and daemons but also disk space and load ) every few minutes and sends and alert and/or restarts the service if there is a problems. So after installing the package ( apt-get install monit ) I just created a series of rules like:
check process exim4 with pidfile /var/run/exim4/exim.pid start program = "/etc/init.d/exim4 start" stop program = "/etc/init.d/exim4 stop" if failed host 127.0.0.1 port 25 protocol smtp then alert if 5 restarts within 5 cycles then timeout check process popa3d with pidfile /var/run/popa3d.pid start program "/etc/init.d/popa3d start" stop program "/etc/init.d/popa3d stop" if failed port 110 protocol pop then restart if 5 restarts within 5 cycles then timeout
for the main processes on the machine. Sample rules are available in the config file and documentation and google is fairly safe as long as you make sure you don’t copy a 10th generation rule of a “Ruby on Rails” site ( ROR components apparently require frequent restarts). All up the whole install and configuration took me around half an hour and I’m now monitoring:
# monit summary System 'crimson.usenet.net.nz' running Process 'lighttpd' running Process 'sshd' running Process 'named' running Process 'exim4' running Process 'popa3d' running Process 'mysql' running Process 'mailman' running Device 'rootfs' accessible Process 'mailman' running