Using Bots to Scale incident Management – Anthony Angell (Xero)
- Who we are
- Single Team
- Just a platform Operations team
- SRE team is formed
- Ops teams plus performance Engineering team
- Incident Management
- In Bad old days – 600 people on a single chat channel
- Created Framework
- what do incidents look like, post mortems, best practices,
- How to make incident management easy for others?
- ChatOps (Based on Hubot)
- Automated tour guide
- Multiple integrations – anything with Rest API
- Reducing time to restore
- Flexability
- Release register – API hook to when changes are made
- Issue report form
- Summary
- URL
- User-ids
- how many users & location
- when started
- anyone working on it already
- Anything else to add.
- Chat Bot for incident
- Populates for an pushes to production channel, creates pagerduty alert
- Creates new slack channel for incident
- Can automatically update status page from chat and page senior managers
- Can Create “status updates” which record things (eg “restarted server”), or “yammer updates” which get pushed to social media team
- Creates a task list automaticly for the incident
- Page people from within chat
- At the end: Gives time incident lasted, archives channel
- Post Mortum
- More integrations
- Report card
- Change tracking
- Incident / Alert portal
- High Availability – dockerisation
- Caching
- Pageduty
- AWS
- Datadog