DevOpsDaysNZ 2018 – Day 1 – Session 1

Jeff Smith – Moving from Ops to DevOps: Centro’s Journey to the Promiseland

Everyone’s transformation will look a little different
Tools are important but not the main problem (dev vs Ops)
Hiring a DevOps Manager
- Just sounds like a normal manager
- Change it to “Director of Production Operations”
A “DevOps Manager” is the 3rd silo on top of Dev and Ops
What did peopel say was wrong when he got there?
- Paternalistic Ops view
  - Devs had no rights on instances
  - Devs no prod access
  - Devs could not create alerts
Fix to reduce Ops load
- Devs get root to instances, but access to easily destroy and recreate if they broke it
- Devs get access to common safe tasks, required automation and tools (which Ops also adopted)
- Migrated to datadog – Single tool for all monitoring that anyone could access/update.
- Shared info about infrastructure. Docs, lunch and learns. Pairing.
Expanding the scope of Ops
- Included in the training and dev environment, CICD. Customers are internal and external
- Used same code to build every environment
- Offering Operation expertise
  - Don’t assume the people who write the software know the best way to run it
Behaviour can impact performance
- See book “Turn the Ship around”
- Participate in Developer rituals – Standups, Retros
- Start with “Yes.. But” instead of “No” for requests. Assume you can but make it safe
- Can you give me some context. Do just do the request, get the full picture.
Metrics to Track
- Planned vs unplanned work
- What are you doing lots of times.
What we talk about?
- Don’t allow your ops dept to be a nanny
- Remove nanny state but maintain operation safety
- Monitor how your language impacts behavour
- Monitor and track the type of work you are doing

François Conil – Monitoring that cares (the end of user based monitoring)

User Based monitoring (when people who are affected let you know it is down)
Why are we not getting alerts?
- We are are not measuring the right thing
- Just ignore the dashboard (always orange or red)
- Just don’t understand the system
First Steps
- Accept that things are not fine
- Decide what you need to be measuring, who needs to know, etc. First principals
- A little help goes a long way ( need a team with complementary strengths)
Actionable Alerts
- Something Broken, User affected, I am the best person to fix, I need to fix immediately
- Unless all 4 apply then nobody should be woken up.
  - Measured: Take to QA or performance engineers to find out the baseline
  - User affected: If nobody is affected do we care? Do people even work nights? How do you gather feedback?
  - Best person to fix: Should ops guys who doesn’t understand it be the first person to page?
  - Do it need to be fixed? – Backup environment, Too much detail in the alerts, Don’t alert on everything that is broken, just the one causing the problem
Fix the cause of the alerts that are happening the most often
You need time to get things done
- Talk to people
- Find time for fixes
You need money to get things done
- How much is the current situation costing the company?
- Tech-Debt Friday