Alison Polton-Simon – The DevOps Experiments: Reflections From a Scaling Startup
- Software engineer at Angaza, previously Thoughtworks, “Organizational Anthropologist”
- Angaza
- Enable sales of life-changing products (eg solar chargers, water pumps, cook stoves in 3rd world countries)
- Originally did hardware, changed to software company doing pay-as-you-go of existing devices
- ~50 people. Kenya and SF, 50% engineering
- No dedicated Ops
- Innovate with empathy, Maximise impact
- Model is to provide software tools to activate/deactivate products sold to peopel with low credit-scores. Plus out software around the activity like reports for distributors.
- Reliability
- Platform is business critical
- Outages disrupt real people (households without light, farmers without irrigation)
- Buildkite, Grafana, Zendesk
- Constraints
- Operate in 30+ countries
- Low connectivity, 2G networks best case
- Rural and peri-urban areas
- Team growing by 50% in 2018 (2 eng teams in Kenya + 1 QA)
- Most customers in timezone where day = SF night
- Eras of experimentation
- Ad Hoc
- Tributes (sacrifice yourself for the stake of the team)
- Collectives (multiple teams)
- Product teams
- ad Hoc – 5 eng
- 1 eng team
- Ops by day – you broke, you fix
- Ops by night – Pagerduty rotation
- Paged on all backend exception, 3 pages = amnesty
- Good
- Small but senior
- JIT maturity
- Everyone sitting next to each other
- Bad
- Each incident higherly disruptive
- prioritized necessity over sustainability
- Tribute – 5-12 eng
- One person protecting team from interuptions
- Introduced support person and triage
- Expanded PD rotation
- Good
- More sustainable
- Blue-Green deploys
- Clustered workloads
- Not
- Headcount != horizontal scaling
- Hard to hire
- Customer service declined
- Collectives 13-20 engs
- Support and Ops teams – Ops staffed with devs
- Other teams build roadmaps and requests
- Teams rotate quarterly – helps onboarding
- Good
- Cross train ppl
- Allow for focus
- allowed ppl to get depth
- Bad
- Teams don’t op what they built
- Quarter flies by quickly
- Context switch is costly
- Still a juggling act
- 1m ramping up, 6w going okay, 2w winging down
- Product teams 21 -? eng
- 5 teams, 2 in Nairobi
- Teams allighned with business virticals, KPIs
- Dev, own and maintain services
- Per-team tributes
- No [Dev]Ops team
- Intended goals
- Independent teams
- own what build
- Support biz KPIs
- cross team coordination
- Expected Chellenges
- Ownership without responsbility
- Global knowledge sharing
- Return to tribute system (2w out of the workflow)
- Next
- Keep growing team
- Working groups
- Eventual SRE
- 24h global coverage
- Case a “Constitution” of values that everybody who is hired signs
- Takeaways
- Maximise impact
- Dependable tools over fashionable ones
- Prefer industry-std tech
- But get creative when necessary
- Define what reliability means for your system
- Evolve with Empathy
- Don’t be dogmatic without structure
- Serve your customers and your team
- Adapt when necessary
- Talk to people
- Be explicit as to the tradeoffs you are making
- Maximise impact
Anthony Borton – Four lessons learnt from Microsoft’s DevOps Transformation
- Microsoft starting in 1975
- 93k odd engineers at Microsoft
- 78k deployments per day
- 2m commits per month
- 4.4 builds/month
- 500 million tests/day
- 2018 State of Devops reports looks at Elite performers in the space
- TFS – Team Foundation Server
- Move product to the cloud
- Moved on-prem to one instance
- Each account had it’s own DB (broke stuff at 11k DBs)
- 4 lessons
- Customer focussed
- Listen to customers, uservoice.com
- Lots of team members keep eye on it
- Stackoverflow
- Embed with customers
- Feedback inside product
- Have to listen in all the channels
- Failure is an opportunity to learn
- Definition of done
- Live in prod, collecting telemetary that examines hypotheses that it was created to prove
- “For those of you who don’t know who Encarta is, look it up on Wikipedia”
- Customer focussed
- Team Structure
- Combined engineering = devs + testing
- Some testers left team or organisation
- Feature team
- Physical team rooms
- Cross discipline
- 10-12 people
- self managing
- Clear charter and goals
- Intact for 12-18 months
- Sticky note exercise, people pick which teams they would like to join (first 3 choices)
- 20% choose to change
- 100% get the choice
- Combined engineering = devs + testing
- New constants and requirements
- Problems
- Tests took too long – 22h to 2days
- Tests failed frequently – On 60% passed 100%
- Quality signal unreliable in master
- Publish VSTS Quality vision
- Sorted by exteranl dependancies
- Unit tests
- L0 – in-memory unit tests
- L1 – More with SQL
- Functional Tests
- L2 – Functional tests against testable service deployment
- L3 – Restricted class integration tests that run against production
- 83k L0 tests run agains all pulls very fast
- Problems
- Deploy to rings of users
- Ring 0 – Internal Only
- Ring 1 – Small Datacentre 1-1.5m accounts in Brazil (same TZ as US)
- Ring 2 – Public accounts, medium-large data centre
- Ring 3 – Large internal accounts
- Ring 4-5 – everyone else
- Takes about a week for normal releases.
- Binaries go out and then the database changes
- Delays of minutes (up to 75) during the deploys to allow bugs to manafest
- Some customers have access to feature flags
- Customers who are risk tolerant can opt in to early deploys. Allows them to get faster feedback from people who are able to provide it
- More features delivered in 2016 than previous 4 years. 50% more in 2017