DevOps Days Auckland 2017 – Tuesday Session 2

Using Bots to Scale incident Management – Anthony Angell (Xero)

  • Who we are
    • Single Team
    • Just a platform Operations team
  • SRE team is formed
    • Ops teams plus performance Engineering team
  • Incident Management
    • In Bad old days – 600 people on a single chat channel
    • Created Framework
    • what do incidents look like, post mortems, best practices,
    • How to make incident management easy for others?
  • ChatOps (Based on Hubot)
    • Automated tour guide
    • Multiple integrations – anything with Rest API
    • Reducing time to restore
    • Flexability
  • Release register – API hook to when changes are made
  • Issue report form
    • Summary
    • URL
    • User-ids
    • how many users & location
    • when started
    • anyone working on it already
    • Anything else to add.
  • Chat Bot for incident
    • Populates for an pushes to production channel, creates pagerduty alert
    • Creates new slack channel for incident
    • Can automatically update status page from chat and page senior managers
    • Can Create “status updates” which record things (eg “restarted server”), or “yammer updates” which get pushed to social media team
    • Creates a task list automaticly for the incident
    • Page people from within chat
    • At the end: Gives time incident lasted, archives channel
    • Post Mortum
  • More integrations
    • Report card
    • Change tracking
    • Incident / Alert portal
  • High Availability – dockerisation
  • Caching
    • Pageduty
    • AWS
    • Datadog

 

Share