Linux.conf.au 2014 – Day 3 – Session 3

Continuous Integration for your database migrations by Michael Still

  • Running unit and integration test on all patches
  • Terminology
    • sqlalchemy – The database ORM the Openstack nova uses
    • Schema version: a single database schema, represented by a number
    • Database migration: the process of moving between schema versions
  • Motivation
    • Test weren’t testing upgrades on real large production data
    • We found the following things
      • Schema drift – some deployments had schemas that wenr’t possible to upgrade because they didn’t match current tools
      • Performance issues – Some upgrades took too long
      • Broken downgrades – didn’t work for non-trivial downgrades
    • Are downgrades important?
  • Turbo-hipster is a test runner
    • A series of test plugins
    • Register with Zuul
    • Runs task plugins when requested, return results
  • Task Plugin
    • Python Plugin
  • The DB upgrade plugin
    • Upgrade to tunk
    • Upgrade to the patch
    • Downgrade to the 1st migration in the release
    • Upgrade again
    • Pass / fail based on analysis of the logs from the shell script
  • Lets go made with plugins
    • Email people when code they worked on is changed by others
    • Cause doc bugs to be created
    • Cause dependant patches when a patch requres changes to “flow down” repos.
    • Much already in Gerrit but does it need to be?
  • OMG Security
    • This is a bit scary
    • We’re running code on our workers provided by 3rd parties
    • Mitigation
      • Limited access to nodes
      • untrusted code tested with network turned off
      • checks logs for suspicious data
      • We’re working on dataset anonymousation
  • Running a process with networking turned off
    • Explored LXC (containers)
    • netns is much simpler
  • Interesting Bugs
    • Slow upgrade -> Dev iterated his code multiple times ran against the test until was fast enough
  • Would be happy to do this with Postgres if Postgres community wants to help get it going

 

Live upgrading many thousands of servers from an ancient RedHat 7.1 to a 10 year newer Debian based one by Marc Merlin

  • Longer version http://marc.merlins.org/linux/talks/ProdNG-LCA2014/
  • Google Started with a Linux CD (in our case Red Hat 6.2)
  • Then kickstart
  • updates had ssh loops to connect to machines and upgrade
  • Any push based method is doomed
  • Running from cron will break eventually
  • Across thousands of machines a percentage will fail and have to br fixed by hand
  • File Level syncing
    • makes all you servers the same
    • Exclude a few files (resolv.conf, syslog)
    • Doesn’t scale well but he can have rsync-like software that doesn’t something similar
  • All servers are the same
    • for the root partition yes
    • per-machine software outside root parition
    • static links for libraries
    • hundreds of different apps with own dependencies
  • How to upgarde root partition
    • just security upgrades mainly
    • running Redhat 7.1 for a long time
  • How to upgrade base packages
    • upgrade packages, create and test new master image, slowly push to live
    • only two images in prod, current and the old one
  • How about pre/post installs?
    • removed most of them
    • sync daemon has a watch on some files and does something when that file changed
  • How did running 7.1 work out?
    • It works a long time but not forever
    • Very scary
    • Oh and preferable not reboot the machines if at all possible
  • What new distribution?
    • Workstations already moved to debian from redhat
    • Debian has more packages
    • Ubuntu is better than debain so started with Ubuntu Dapper
  • Init System choice
    • Boot time not a big decided
    • Consistent Boot order very useful
    • systemd a lot of work to convert, upstart a lot too
    • systemd option for future
  • ProdNG
    • self hosting
    • Entirely rebuilt from source
    • Remove unneeded dependencies
    • end distribution 150MB (without google custom bits)
    • No complivated upstart, dbus, plymouth
    • Small is quicker to sync
    • Newer packages not always better, sometimes old is good, new stuff as lots of extra stuff you might not need
  • How to push it
    • 20k+ files changed
    • How to convince people it will work, how to test?
    • push hard to do slowly, have to maintain 2 very different systems in prod
  • Turned into many smaller jumps
    • Take debian packages into rpms and install on existing server one at a time
  • Cruft Removal
    • Get rid of junks, like X fonts, X server, random locales
    • Old libs nothing is using
    • No C++ left so libstdc++ removed
  • One at time
    • Upgrade libc from 2.2.2 to 2.3.6
    • Upgrade small packages and work up
    • 150 packages upgraded a few at a time. took just over 2 years
  • Convert rpms to debs
    • Same packages on both images
    • Had to convert internal packages from rpms to debs
    • used alien and custom scrip to convert.
    • changelogs have more fixed format in debs than rpms
  • Switch live base packages everything back to debs
    • Only one major bug
  • Lessons learned
    • If you are maintain a lot of machines if you have your own fork you can remove all the bits you don’t need
    • Forcing server uses to use an API you provide and not to write to the root FS
    • File level sync recovers from any state and is more reliable than most other methods
    • You can do crazy things like distribution switches
    • Don’t blindly install upstream updates
    • If you don’t need it remove it
    • You probably don’t want to run the latest thing, more trouble than it is worth
    • Smaller jumps is easier
    • the best way to do a huge upgrade is a few packages at a time