DevOps Days Auckland 2017 – Wednesday Session 2

Marcus Bristol (Pushpay) – Moving fast without crashing

  • Low tolerance for errors in production due to being in finance
  • Deploy twice per day
  • Just Culture – Balance safety and accountability
    • What rule?
    • Who did it?
    • How bad was the breach?
    • Who gets to decide?
  • Example of Retributive Culture
    • KPIs reflect incidents.
    • If more than 10% deploys bad then affect bonus
    • Reduced number of deploys
  • Restorative Culture
  • Blameless post-mortem
    • Can give detailed account of what happened without fear or retribution
    • Happens after every incident or near-incident
    • Written Down in Wiki Page
    • So everybody has the chance to have a say
    • Summary, Timeline, impact assessment, discussion, Mitigations
    • Mitigations become highest-priority work items
  • Our Process
    • Feature Flags
    • Science
    • Lots of small PRs
    • Code Review
    • Testers paired to devs so bugs can be fixed as soon as found
    • Automated tested
    • Pollination (reviews of code between teams)
    • Bots
      • Posts to Slack when feature flag has been changed
      • Nags about feature flags that seems to be hanging around in QA
      • Nags about Flags that have been good in prod for 30+ days
      • Every merge
      • PRs awaiting reviews for long time (days)
      • Missing postmortun migrations
      • Status of builds in build farm
      • When deploy has been made
      • Health of API
      • Answer queries on team member list
      • Create ship train of PRs into a build and user can tell bot to deploy to each environment
Share