Site Reliability Engineering at Dropbox – Tammy Butow
- Having a SLA, measuring against it. Caps OPSwork, Blameless Post Mortum, Always be coding
- 400 M customer, billion files every day
- Very hard to find people to scale, so build tool to scale instead
- Team looks at 6,000 DB machines, look after whole machines not just the app
- Build a lot of tools in python and go
- PygerDuty – python library for pagerduty API
- Easy to find the top things paging, write tools to reduce these
- Regular weekly meeting to review those problems and make them better
- If work is happening on machines then turn off monitoring on them so people don’t get woken up for things they don’t need to.
- Going for days without any pages
- Self-Healing and auto-remediation scripts
- Hermes
- Allocate and track tasks to groups
- Automation of DB tasks
- Bot copies pagerduty alerts in slack
- Aim Higher
- Create a roadmap for next 12 months
- Buiding a rocketship while it is flying though the sky
- Follow the Sun so people are working days
- Post Mortem for every page
- Frequent DR testing
- Take time out to celebrate
I missed out writing up the next couple of talks due to technical problems