Jeff Smith – Moving from Ops to DevOps: Centro’s Journey to the Promiseland
- Everyone’s transformation will look a little different
- Tools are important but not the main problem (dev vs Ops)
- Hiring a DevOps Manager
- Just sounds like a normal manager
- Change it to “Director of Production Operations”
- A “DevOps Manager” is the 3rd silo on top of Dev and Ops
- What did peopel say was wrong when he got there?
- Paternalistic Ops view
- Devs had no rights on instances
- Devs no prod access
- Devs could not create alerts
- Paternalistic Ops view
- Fix to reduce Ops load
- Devs get root to instances, but access to easily destroy and recreate if they broke it
- Devs get access to common safe tasks, required automation and tools (which Ops also adopted)
- Migrated to datadog – Single tool for all monitoring that anyone could access/update.
- Shared info about infrastructure. Docs, lunch and learns. Pairing.
- Expanding the scope of Ops
- Included in the training and dev environment, CICD. Customers are internal and external
- Used same code to build every environment
- Offering Operation expertise
- Don’t assume the people who write the software know the best way to run it
- Behaviour can impact performance
- See book “Turn the Ship around”
- Participate in Developer rituals – Standups, Retros
- Start with “Yes.. But” instead of “No” for requests. Assume you can but make it safe
- Can you give me some context. Do just do the request, get the full picture.
- Metrics to Track
- Planned vs unplanned work
- What are you doing lots of times.
- What we talk about?
- Don’t allow your ops dept to be a nanny
- Remove nanny state but maintain operation safety
- Monitor how your language impacts behavour
- Monitor and track the type of work you are doing
François Conil – Monitoring that cares (the end of user based monitoring)
- User Based monitoring (when people who are affected let you know it is down)
- Why are we not getting alerts?
- We are are not measuring the right thing
- Just ignore the dashboard (always orange or red)
- Just don’t understand the system
- First Steps
- Accept that things are not fine
- Decide what you need to be measuring, who needs to know, etc. First principals
- A little help goes a long way ( need a team with complementary strengths)
- Actionable Alerts
- Something Broken, User affected, I am the best person to fix, I need to fix immediately
- Unless all 4 apply then nobody should be woken up.
- Measured: Take to QA or performance engineers to find out the baseline
- User affected: If nobody is affected do we care? Do people even work nights? How do you gather feedback?
- Best person to fix: Should ops guys who doesn’t understand it be the first person to page?
- Do it need to be fixed? – Backup environment, Too much detail in the alerts, Don’t alert on everything that is broken, just the one causing the problem
- Fix the cause of the alerts that are happening the most often
- You need time to get things done
- Talk to people
- Find time for fixes
- You need money to get things done
- How much is the current situation costing the company?
- Tech-Debt Friday