Server recovery first steps

At work last week we were looking at backups on a group of machines that had been installed by another company but which our team had recently taken over. I was interested in the backup system they had which involved doing a lvm snapshot of the boot partition and then rsyncing this to another machine is the group ( the rsync’s went around in a circle more or less).

This looked quite cute for quick machine recoveries ( we kickstart our servers but we are still at the stage of doing a fair bit of post install setup ) and we had a think about recovering machines by doing a simple kickstart, then netbooting the server, mounting the root partition under the netboot and rsyncing it back to the install. This seemed a promising idea which we thought would only take an hour or so per machine.

However over the weekend I had a bit of a think and it popped into my head that Mondorescue almost did this sort of thing out of the box already. So I’ve been playing around a it this week with it.

So what I have now ( testing using a scratch VM ) are a few commands that:

  1. Backup the server to a NFS partition.
  2. Make an differential backup since the previous backup

Which means I now have a directory on a NFS server with a couple of bootable ISOs sitting in it. One has the full backup of the machine ( it’s about a third of the size of the used space ) and the other has any changes made since the first was done. I do the differential since the full backup takes about 30 minutes of hard work for the server while the incremental only takes 3 minutes or so ( YMMV ). I’ll probably do full backups every week and differential backups nightly.

The fun bit is the recovery:

  1. Remove console the server and boot it over the network
  2. Use PXE to boot the full backup mondorescue image
  3. Mondo boots and thee automatic restores the server to the state is was when the last backup was made (about 15 minutes) . I then have to hit enter a couple of times to reboot
  4. Netboot the incremental mondo image.
  5. Mondo now applies any changes between the last full and the last differential backup.
  6. Reboot again to the hard drive
  7. Finished, machine should be up and running.

A bit of testing shows this only takes about 20 minutes for my test VM ( 3 Gigabytes of default RHE 5 goodness ) and production servers shouldn’t be much slower ( more data but faster disks and CPUs ).

With a bit of luck I should have this ready to deploy in a few days ( although I’m a little short of NFS space to apply it to every machine ).

Overall a fun couple of days, depending on how it goes I might even do a lightning talk about it at the Sysadmin Miniconf next month although I’m not sure if it’s a little trivial since this is close to “out of the box” functionality for Mondorescue.