November 2018 – Simon Lyall's Blog

DevOpsDaysNZ 2018 – Day 2 – Session 4

Allen Geer, Amanda Baker – Continuously Testing govt.nz

Various .govt.nz sites
All Silverstripe and Common Web Platform
Many sites out of date, no automated testing, no test metrics, manual testing
Micro-waterfall agile
Specification by example (prod owner, Devops, QA) created Gherkin tests
Standardised on CircleCI
Visualised – Spec by example
Prioritised feature tests
Ghirkinse
Test at start of dev process. Bake Quality in at the start
Visualise and display metrics, people could then improve.
Path to automation isn’t binary
Involve everyone in the team
Automation only works if humanised

Jules Clements – Configuration Pipeline : Ruling the One Ring

Desired state
I didn’t quite understand what he was saying

Nigel Charman – Keep Calm and Carry On Organising

71 Conferences worldwide this year
NZ following the rules
Lots of help from people
Stuff stuff stuff

Jessica DeVita – Retrospecting our Retrospectives

Works on Azure DevOps
Post-mortems
What does it mean to have robust systems and resilience? Is resilience even a property? It just Is. When we fly on planes, we’re trusting machines and automation. Even planes require regular reboots to avoid catastrophic failures, and we just trust that it happen
CEO after a million dollar outage said “Can you get me a million dollars of learning out of this?”
After US Navy had accidents caused by slept deprivation switched to new watch structure
Postmortems are not magic, they don’t automatically make things change
http://stella.report
We dedicate a lot of time to to below the line, looking at the technology. Not a lot of conversation about above-the-line things like mental models.
Resilience is above the line
Catching the Apache SNAFU
The Ironies of Automation – Lisanne Bainbridge
Well facilitated debriefings support recalibration of mental models
US Forest Service – Learning Review – Blame discourages people speaking up about problems
We never know where the accident boundary is, only when we have crossed it.
- SRE, Chaos Engineer and Human Factors help hadle
In postmortems please be mindful of judging timelines without context. Saying something happened in a short or long period of time is damanging
Ask “what made it hard to get that team on the phone?” , “What were you trying to achieve”
Etsy Debriefing Guide – lots of important questions.
“Moving post shallow incident data” – Adaptive Capacity Labs
Safety is a characteristics of Systems and not of their components
Ask people about their history, ask every person about what they do and how they got there because that is what shapes your culture as an organisation

DevOpsDaysNZ 2018 – Day 2 – Session 3

Kubernetes

I’ll fill this in later.

Observability

Honeycomb, Sumologic. Use AI to look at what happened at same time and magically correlate
Expensive or hard to send all logs as volumes go up
What is the logging is wrong or missing?
Metrics
- Export in prometheus format
- Read RED and USE paper
- Create a company schema with half a dozen metrics that all services expose
Had and event or transaction ID that flows across all the microservices sorry logs can be correlated
Non technical solutions
- Refer to previous incident logs
- Part of deliverables for product is SLA stats which require logs etc
Testing logs
- Make sure certain events produce a log
Chaos Monkey

ANZ Drivetrain

Change control cares about
- Avaiability
- Risk
- Dependencies
- Rollback
But the team doing the change knows about these all
Saw tools out there that seem very opinated
Drivetrain
- Automated Checklist
- Work with Change people to create checklist
- Pipeline talks to drivetrain and tells it what has been down
- Slack messages sent for manual changes (they login to app to approve)
Looked at some other tools (eg chef automate, udeploy )
- Forced team to work in a certain pattern
But use ServiceNow tool as official corporate standard
- Looking at making DriveTrail fill in ServiceNow forms
People worried about stages in tool often didn’t realise the existing process had same limitations
Risk assessed at the Story and Feature level. Not release level
Not suitable for products that due huge released every few months with a massive number of changes.

DevOpsDaysNZ 2018 – Day 2 – Session 2

Interesting article I read today

Why Doctors Hate their Computers by Atul Gawande

Mrinal Mukherjee – A DevOps Confessional

Not about accidents, it is about Planned Blunders that people are doing in DevOps
One Track DevOps
- From Infrastructure background
- Job going into places, automated the low hanging fruit, easy wins
- Column of tools on resume
- Started becoming the bottleneck, his team was the only one who knew how the infrastructure worked.
- Not able to “DevOps” a company since only able to fix the infrastructure, not able to fix testing etc so not dilvering everything that company expected
- If you are the only person who understands the infrastructure you are the only one blamed when it goes wrong
- Fixes
  - Need to take all team on a journey
  - But need to have right expectations set
  - Need to do learning in areas where you have gaps
  - DevOps is not about individual glory, Devops is about delivering value
  - HR needs to make sure they don’t reward the wrong thing
MVP-Driven Devops
- Mostly working on Greenfields products that need to be delivered quickly
- MVP = Maximum Technical Debt
- MVP = Delays later and Security audits = Your name attached to the problem
- Minimum Standard of Engineering
  - Test cases, Documentation, Modular
  - Peer review
- Evolve architecture, not re-architect
Judgemental Devops
- That team sucks, they are holding things up, playing a different game from us
- Laughing at other teams
- Consequence – Stubbornness from the other team
- Empathy
  - Find out why things are they way they are
- Collaborate to find common ground and improve
- Design my system to I plan to work within constraints of the other team

DevOpsDaysNZ 2018 – Day 2 – Session 1

Alison Polton-Simon – The DevOps Experiments: Reflections From a Scaling Startup

Software engineer at Angaza, previously Thoughtworks, “Organizational Anthropologist”
Angaza
- Enable sales of life-changing products (eg solar chargers, water pumps, cook stoves in 3rd world countries)
- Originally did hardware, changed to software company doing pay-as-you-go of existing devices
- ~50 people. Kenya and SF, 50% engineering
- No dedicated Ops
- Innovate with empathy, Maximise impact
- Model is to provide software tools to activate/deactivate products sold to peopel with low credit-scores. Plus out software around the activity like reports for distributors.
Reliability
- Platform is business critical
- Outages disrupt real people (households without light, farmers without irrigation)
- Buildkite, Grafana, Zendesk
Constraints
- Operate in 30+ countries
- Low connectivity, 2G networks best case
- Rural and peri-urban areas
- Team growing by 50% in 2018 (2 eng teams in Kenya + 1 QA)
- Most customers in timezone where day = SF night
Eras of experimentation
- Ad Hoc
- Tributes (sacrifice yourself for the stake of the team)
- Collectives (multiple teams)
- Product teams
ad Hoc – 5 eng
- 1 eng team
- Ops by day – you broke, you fix
- Ops by night – Pagerduty rotation
- Paged on all backend exception, 3 pages = amnesty
- Good
  - Small but senior
  - JIT maturity
  - Everyone sitting next to each other
- Bad
  - Each incident higherly disruptive
  - prioritized necessity over sustainability
Tribute – 5-12 eng
- One person protecting team from interuptions
- Introduced support person and triage
- Expanded PD rotation
- Good
  - More sustainable
  - Blue-Green deploys
  - Clustered workloads
- Not
  - Headcount != horizontal scaling
  - Hard to hire
  - Customer service declined
Collectives 13-20 engs
- Support and Ops teams – Ops staffed with devs
- Other teams build roadmaps and requests
- Teams rotate quarterly – helps onboarding
- Good
  - Cross train ppl
  - Allow for focus
  - allowed ppl to get depth
- Bad
  - Teams don’t op what they built
  - Quarter flies by quickly
  - Context switch is costly
  - Still a juggling act
  - 1m ramping up, 6w going okay, 2w winging down
Product teams 21 -? eng
- 5 teams, 2 in Nairobi
- Teams allighned with business virticals, KPIs
- Dev, own and maintain services
- Per-team tributes
- No [Dev]Ops team
- Intended goals
  - Independent teams
  - own what build
  - Support biz KPIs
  - cross team coordination
- Expected Chellenges
  - Ownership without responsbility
  - Global knowledge sharing
  - Return to tribute system (2w out of the workflow)
Next
- Keep growing team
- Working groups
- Eventual SRE
- 24h global coverage
Case a “Constitution” of values that everybody who is hired signs
Takeaways
- Maximise impact
  - Dependable tools over fashionable ones
  - Prefer industry-std tech
  - But get creative when necessary
- Define what reliability means for your system
- Evolve with Empathy
  - Don’t be dogmatic without structure
  - Serve your customers and your team
  - Adapt when necessary
  - Talk to people
  - Be explicit as to the tradeoffs you are making

Anthony Borton – Four lessons learnt from Microsoft’s DevOps Transformation

Microsoft starting in 1975
93k odd engineers at Microsoft
- 78k deployments per day
- 2m commits per month
- 4.4 builds/month
- 500 million tests/day
2018 State of Devops reports looks at Elite performers in the space
TFS – Team Foundation Server
- Move product to the cloud
- Moved on-prem to one instance
- Each account had it’s own DB (broke stuff at 11k DBs)
4 lessons
- Customer focussed
  - Listen to customers, uservoice.com
  - Lots of team members keep eye on it
  - Stackoverflow
  - Embed with customers
  - Feedback inside product
  - Have to listen in all the channels
- Failure is an opportunity to learn
- Definition of done
  - Live in prod, collecting telemetary that examines hypotheses that it was created to prove
- “For those of you who don’t know who Encarta is, look it up on Wikipedia”
Team Structure
- Combined engineering = devs + testing
  - Some testers left team or organisation
- Feature team
  - Physical team rooms
  - Cross discipline
  - 10-12 people
  - self managing
  - Clear charter and goals
  - Intact for 12-18 months
- Sticky note exercise, people pick which teams they would like to join (first 3 choices)
  - 20% choose to change
  - 100% get the choice
New constants and requirements
- Problems
  - Tests took too long – 22h to 2days
  - Tests failed frequently – On 60% passed 100%
  - Quality signal unreliable in master
- Publish VSTS Quality vision
  - Sorted by exteranl dependancies
  - Unit tests
    - L0 – in-memory unit tests
    - L1 – More with SQL
  - Functional Tests
    - L2 – Functional tests against testable service deployment
    - L3 – Restricted class integration tests that run against production
  - 83k L0 tests run agains all pulls very fast
Deploy to rings of users
- Ring 0 – Internal Only
- Ring 1 – Small Datacentre 1-1.5m accounts in Brazil (same TZ as US)
- Ring 2 – Public accounts, medium-large data centre
- Ring 3 – Large internal accounts
- Ring 4-5 – everyone else
- Takes about a week for normal releases.
- Binaries go out and then the database changes
- Delays of minutes (up to 75) during the deploys to allow bugs to manafest
- Some customers have access to feature flags
- Customers who are risk tolerant can opt in to early deploys. Allows them to get faster feedback from people who are able to provide it
More features delivered in 2016 than previous 4 years. 50% more in 2017

DevOpsDaysNZ 2018 – Day 1 – Session 4

Everett Toews – A Trip Down CI/CD Lane

I missed most of this talk. Sounded Good.

Jeff Smith – Creating Shared Contexts

Ideas and viewpoints are different from diff people
Happens in organisation, need to make sure everybody is on the same page
Build a shared context via conversations
Info exchange
Communications tools
Context Tools
X/Y Problem
Data can bridge conversations. Shared reality.
Use the same tools as other teams so you know what they are doing
Give the context to your requests, ask for them and it will automatically happen eventually.

Peter Sellars – 2018: A Build Engineers Odyssey

Hungry, Humble and Smart

Katrina Clokie – Testing in DevOps for Engineers

We can already write, so how hard can it be to write a novel?
Hopefully some of you are doing testing already
Problem is that people overestimate their testing skills, not interested in finding out anything else.
The testing you are doing now might be with one tool, in one spot. You are probably finding stuff but missing other things
Why important
- Testing is part of you role
- In Devops testing goes though Operations as well
- Testing is DevOps is like air, it is all around you in every role
- Roles of testers is to tech people to breath continuously and naturally.
Change the questions that you ask
- How do you know that something is okay? What questions are you asking of your product?
- Oracles are the ways that we recognise a problem
- Devs ask: “Does it work how we think it should?”
- Ops ask: “Does it work how it usually works?”
- Devs on claims, Ops on history
- Does it work like our competors, does it meet it’s purpose without harmful side effects, doesn’t it meet legal requirements, Does it work like similar services.
- HICCUPPS – Testing without a Map – Michael Bolton, 2005
- How do we compare to what other people are doing? ( Not just a BA’s job , cause the customer will be asking a question and so should you)
- Flip the Oracle, compare them against other things not just the usual.
- Audit – Continuous compliance, Always think about if it works like the standards say it should.
- These are things that the business is asking. If you ask then gain confidence of business
Look for Answers in Other Places
- Number of tests: UI < Service < Unit
- The Test Pyramid as a bug catcher. Catch the Simple bugs first and then the subtle ones
Testing mesh
- Unit tests – fine mesh
- Intergration – Bigger/Fewer tests but cover more
- Next few layers: End to End, Alerting, Monitoring, Logging. Each stops different types of bugs
- Conversation should be “Where do we put our mesh?”, “How far can this bug fall?”.
- If another layer will pick the bug up do we need a test.
Use Monitoring as testing
- Push risk really late, no in all cases but can often work
A/B testing
- Ops needs to monitor
- Dev needs framework to role out and put in different options
Chaos Engineering
- Start with something small, communicate well and do during daylight hours.
- Yours customers are testing in production all the time, so why arn’t you too?
https://leanpub.com/testingindevops

DevOpsDaysNZ 2018 – Day 1 – Session 3

Open Space 1 – Prod Support, who’s responsible

Problem that Ops doesn’t know products, devs can’t fix, product support owners not technical enough
Xero have embedded Ops and dev in teams. Each person oncall maybe 2 weeks in 20
Customer support team does everything?
“Ops have big graphs on screens, BI have a couple of BI stats on screens, Devs have …youtube videos”
Tiers support vs Product team vs Product support team
Tiered support
- Single point of entry
- lower paid person can handle easy stuff
- Context across multiple apps
Product Team
- Buck stops with someone
- More likely to be able to action
- Ownership of issues
- Everyone must be enabled to do stuff
- Everyone needs to be upskilled
Prod Support
- Big skilled can fix anything team
- Devs not keen
- Even the best teams don’t know everything

Open Space 2 – DevOps at NZ Scale

Devops team, 3rd silo
- Sometimes they are the new team doing cool stuff
- One model is evangelism team
Do you want devops culture or do you just want somebody to look after your pipeline?
Companies often don’t know what they want to hire
Companies get some benefit with the tools (pipelines, agile) but not the culture. But to get the whole benefit they need to adopt everything.
The Way of Ways article by John Cutler

Open Space 3 – Responding Quickly

I was taking notes on the board.

DevOpsDaysNZ 2018 – Day 1 – Session 2

Mark Simpson, Carlie Osborne – Transforming the Bank: pipelines not paperwork

Change really is possible even in the least likely places
Big and risk adverse
- Lots of paperwork and progress, very slow
Needed to change – In the beginning – 18 months ago
- 6 months talking about how we could change things
Looked for a pilot project – Internet Banking team – ~80 people – Go-money platform
- Big monolith, 1m lines of code
- New release every 6 weeks
- 10 weeks for feature from start to production
- Release on midnight on a Friday night, 4-5 hours outage, 20-25 people.
- Customer complaints about outage at midnight, move to 2am Sunday morning
Change to release every single week
- Has to be middle of the day, no outage
- How do we do this?
Took whole Internet banking team for 12 weeks to create process, did nothing else.
What we didn’t do
- Didn’t replatform, no time
What we did
- Jenkins – Created a single Pipeline, from commit to master all the way to projection
- Got rid of selenium tests
- Switched to cypress.io
  - Just tested 5 key customer journeys
- Drivetrain – Internal App
  - Wanted to empower the teams, but lots of limits within industry/regulations
  - Centralise decision making
  - Lightweight Rules engine, checks that all the requirements have been done by the team before going to the next stage.
- Cannery Deployments
  - Two versions running, ability to switch users to one or other
Learning to Break things down into small chunks
Change Process
- Lots of random rules, eg mandatory standdown times
- New change process for teams using Drivetrain, certified process no each release
Lots of times spent talking to people
- Had to get lots of signoffs
Result
- Successful
- 16 weeks rather than 12
- 28 releases in less than 6 months (vs approx 4 previously)
- 95% less toil for each release
Small not Big changes
- Now takes just 4-5 weeks to cycle though a feature
- Don’t like saying MVP. Pitch is as quickly delivering a bit of value
- and iterating
- 2 week pilot, not iterations -> 8 week pilot, 4 iterations
- Solution at start -> Solution derived over time
Sooner, not later
- Previously
  - Risk, operations people not engaged until too late
  - Dev team disengaged from getting things into production
- Now
  - Everybody engaged earlier
Other teams adopting similar approach

Ryan McCarvill – Fighting fires with DevOps

Lots of information coming into a firetruck, displayed on dashboard
Old System was 8-degit codes
Rugged server in each each truck
- UPS
- Raspberry Pi
- Storage
- Lots of different networking
Requirements
- Redundant Comms
- Realtime
- Offline Mpas
- Offline documentation, site reports, photos, building info
- Offline Hazzards
- Allow firefighters to update
- Track appliance and firefighter status
- Be a hub for an incident
- Needs to be very secure
Stack on the Truck
- Ansible, git, docker, .netcode, redis, 20 micoservices
What happens if update fails?
More than 1000 trucks, might be offline for months at a time
How to keep secure
AND iterate quickly
Pipeline
- Online update when truck is at home
- Don’t update if moving
- Blue/Green updates
- Health probes
Visual Studio Team Services -> Azure cont registry
Playbooks on git , ansible pull,
Nginx in front of blue/green
Built – there were problems
- Some overheating
- Server in truck taken out of scope, lost offline strategy
- No money or options to buy new solution
MVP requirements
- Lots of gigs of data, made some so only online
- But many gigs still needed online
- Create virtual firetruck in the sky, worked for online
- Still had communication device – 1 core, minimum storage, locked down Linux
Put a USB stick in the back device and updated it
- Can’t use a lot of resources or will inpact comms
- Hazard search
  - Java/python app, no much impact on system
  - Re-wrote in rust, low impact and worked
  - Changed push to rsync and bash
Lessons
- Automation gots us flexability to change
- Automation gave us flexability to grow
- Creativity can solve any problem
- You can solve new problems with old technology
- Sometimes the only way to get buy in is to just do it.

DevOpsDaysNZ 2018 – Day 1 – Session 1

Jeff Smith – Moving from Ops to DevOps: Centro’s Journey to the Promiseland

Everyone’s transformation will look a little different
Tools are important but not the main problem (dev vs Ops)
Hiring a DevOps Manager
- Just sounds like a normal manager
- Change it to “Director of Production Operations”
A “DevOps Manager” is the 3rd silo on top of Dev and Ops
What did peopel say was wrong when he got there?
- Paternalistic Ops view
  - Devs had no rights on instances
  - Devs no prod access
  - Devs could not create alerts
Fix to reduce Ops load
- Devs get root to instances, but access to easily destroy and recreate if they broke it
- Devs get access to common safe tasks, required automation and tools (which Ops also adopted)
- Migrated to datadog – Single tool for all monitoring that anyone could access/update.
- Shared info about infrastructure. Docs, lunch and learns. Pairing.
Expanding the scope of Ops
- Included in the training and dev environment, CICD. Customers are internal and external
- Used same code to build every environment
- Offering Operation expertise
  - Don’t assume the people who write the software know the best way to run it
Behaviour can impact performance
- See book “Turn the Ship around”
- Participate in Developer rituals – Standups, Retros
- Start with “Yes.. But” instead of “No” for requests. Assume you can but make it safe
- Can you give me some context. Do just do the request, get the full picture.
Metrics to Track
- Planned vs unplanned work
- What are you doing lots of times.
What we talk about?
- Don’t allow your ops dept to be a nanny
- Remove nanny state but maintain operation safety
- Monitor how your language impacts behavour
- Monitor and track the type of work you are doing

François Conil – Monitoring that cares (the end of user based monitoring)

User Based monitoring (when people who are affected let you know it is down)
Why are we not getting alerts?
- We are are not measuring the right thing
- Just ignore the dashboard (always orange or red)
- Just don’t understand the system
First Steps
- Accept that things are not fine
- Decide what you need to be measuring, who needs to know, etc. First principals
- A little help goes a long way ( need a team with complementary strengths)
Actionable Alerts
- Something Broken, User affected, I am the best person to fix, I need to fix immediately
- Unless all 4 apply then nobody should be woken up.
  - Measured: Take to QA or performance engineers to find out the baseline
  - User affected: If nobody is affected do we care? Do people even work nights? How do you gather feedback?
  - Best person to fix: Should ops guys who doesn’t understand it be the first person to page?
  - Do it need to be fixed? – Backup environment, Too much detail in the alerts, Don’t alert on everything that is broken, just the one causing the problem
Fix the cause of the alerts that are happening the most often
You need time to get things done
- Talk to people
- Find time for fixes
You need money to get things done
- How much is the current situation costing the company?
- Tech-Debt Friday