Server recovery first steps
At work last week we were looking at backups on a group of machines that had been installed by another company but which our team had recently taken over. I was interested in the backup system they had which involved doing a lvm snapshot of the boot partition and then rsyncing this to another machine is the group ( the rsync’s went around in a circle more or less).
This looked quite cute for quick machine recoveries ( we kickstart our servers but we are still at the stage of doing a fair bit of post install setup ) and we had a think about recovering machines by doing a simple kickstart, then netbooting the server, mounting the root partition under the netboot and rsyncing it back to the install. This seemed a promising idea which we thought would only take an hour or so per machine.
However over the weekend I had a bit of a think and it popped into my head that Mondorescue almost did this sort of thing out of the box already. So I’ve been playing around a it this week with it.
So what I have now ( testing using a scratch VM ) are a few commands that:
- Backup the server to a NFS partition.
- Make an differential backup since the previous backup
Which means I now have a directory on a NFS server with a couple of bootable ISOs sitting in it. One has the full backup of the machine ( it’s about a third of the size of the used space ) and the other has any changes made since the first was done. I do the differential since the full backup takes about 30 minutes of hard work for the server while the incremental only takes 3 minutes or so ( YMMV ). I’ll probably do full backups every week and differential backups nightly.
The fun bit is the recovery:
- Remove console the server and boot it over the network
- Use PXE to boot the full backup mondorescue image
- Mondo boots and thee automatic restores the server to the state is was when the last backup was made (about 15 minutes) . I then have to hit enter a couple of times to reboot
- Netboot the incremental mondo image.
- Mondo now applies any changes between the last full and the last differential backup.
- Reboot again to the hard drive
- Finished, machine should be up and running.
A bit of testing shows this only takes about 20 minutes for my test VM ( 3 Gigabytes of default RHE 5 goodness ) and production servers shouldn’t be much slower ( more data but faster disks and CPUs ).
With a bit of luck I should have this ready to deploy in a few days ( although I’m a little short of NFS space to apply it to every machine ).
Overall a fun couple of days, depending on how it goes I might even do a lightning talk about it at the Sysadmin Miniconf next month although I’m not sure if it’s a little trivial since this is close to “out of the box” functionality for Mondorescue.