Linux.conf.au 2016 – Sysadmin Miniconf – Session 2

Site Reliability Engineering at Dropbox – Tammy Butow

Having a SLA, measuring against it. Caps OPSwork, Blameless Post Mortum, Always be coding
400 M customer, billion files every day
Very hard to find people to scale, so build tool to scale instead
Team looks at 6,000 DB machines, look after whole machines not just the app
Build a lot of tools in python and go
PygerDuty – python library for pagerduty API
- Easy to find the top things paging, write tools to reduce these
- Regular weekly meeting to review those problems and make them better
- If work is happening on machines then turn off monitoring on them so people don’t get woken up for things they don’t need to.
- Going for days without any pages
Self-Healing and auto-remediation scripts
Hermes
- Allocate and track tasks to groups
Automation of DB tasks
Bot copies pagerduty alerts in slack
Aim Higher
- Create a roadmap for next 12 months
- Buiding a rocketship while it is flying though the sky
Follow the Sun so people are working days
Post Mortem for every page
Frequent DR testing
Take time out to celebrate

I missed out writing up the next couple of talks due to technical problems