Marcus Bristol (Pushpay) – Moving fast without crashing
- Low tolerance for errors in production due to being in finance
- Deploy twice per day
- Just Culture – Balance safety and accountability
- What rule?
- Who did it?
- How bad was the breach?
- Who gets to decide?
- Example of Retributive Culture
- KPIs reflect incidents.
- If more than 10% deploys bad then affect bonus
- Reduced number of deploys
- Restorative Culture
- Blameless post-mortem
- Can give detailed account of what happened without fear or retribution
- Happens after every incident or near-incident
- Written Down in Wiki Page
- So everybody has the chance to have a say
- Summary, Timeline, impact assessment, discussion, Mitigations
- Mitigations become highest-priority work items
- Our Process
- Feature Flags
- Science
- Lots of small PRs
- Code Review
- Testers paired to devs so bugs can be fixed as soon as found
- Automated tested
- Pollination (reviews of code between teams)
- Bots
- Posts to Slack when feature flag has been changed
- Nags about feature flags that seems to be hanging around in QA
- Nags about Flags that have been good in prod for 30+ days
- Every merge
- PRs awaiting reviews for long time (days)
- Missing postmortun migrations
- Status of builds in build farm
- When deploy has been made
- Health of API
- Answer queries on team member list
- Create ship train of PRs into a build and user can tell bot to deploy to each environment