DevOpsDaysNZ 2018 – Day 2 – Session 1

Alison Polton-Simon – The DevOps Experiments: Reflections From a Scaling Startup

Software engineer at Angaza, previously Thoughtworks, “Organizational Anthropologist”
Angaza
- Enable sales of life-changing products (eg solar chargers, water pumps, cook stoves in 3rd world countries)
- Originally did hardware, changed to software company doing pay-as-you-go of existing devices
- ~50 people. Kenya and SF, 50% engineering
- No dedicated Ops
- Innovate with empathy, Maximise impact
- Model is to provide software tools to activate/deactivate products sold to peopel with low credit-scores. Plus out software around the activity like reports for distributors.
Reliability
- Platform is business critical
- Outages disrupt real people (households without light, farmers without irrigation)
- Buildkite, Grafana, Zendesk
Constraints
- Operate in 30+ countries
- Low connectivity, 2G networks best case
- Rural and peri-urban areas
- Team growing by 50% in 2018 (2 eng teams in Kenya + 1 QA)
- Most customers in timezone where day = SF night
Eras of experimentation
- Ad Hoc
- Tributes (sacrifice yourself for the stake of the team)
- Collectives (multiple teams)
- Product teams
ad Hoc – 5 eng
- 1 eng team
- Ops by day – you broke, you fix
- Ops by night – Pagerduty rotation
- Paged on all backend exception, 3 pages = amnesty
- Good
  - Small but senior
  - JIT maturity
  - Everyone sitting next to each other
- Bad
  - Each incident higherly disruptive
  - prioritized necessity over sustainability
Tribute – 5-12 eng
- One person protecting team from interuptions
- Introduced support person and triage
- Expanded PD rotation
- Good
  - More sustainable
  - Blue-Green deploys
  - Clustered workloads
- Not
  - Headcount != horizontal scaling
  - Hard to hire
  - Customer service declined
Collectives 13-20 engs
- Support and Ops teams – Ops staffed with devs
- Other teams build roadmaps and requests
- Teams rotate quarterly – helps onboarding
- Good
  - Cross train ppl
  - Allow for focus
  - allowed ppl to get depth
- Bad
  - Teams don’t op what they built
  - Quarter flies by quickly
  - Context switch is costly
  - Still a juggling act
  - 1m ramping up, 6w going okay, 2w winging down
Product teams 21 -? eng
- 5 teams, 2 in Nairobi
- Teams allighned with business virticals, KPIs
- Dev, own and maintain services
- Per-team tributes
- No [Dev]Ops team
- Intended goals
  - Independent teams
  - own what build
  - Support biz KPIs
  - cross team coordination
- Expected Chellenges
  - Ownership without responsbility
  - Global knowledge sharing
  - Return to tribute system (2w out of the workflow)
Next
- Keep growing team
- Working groups
- Eventual SRE
- 24h global coverage
Case a “Constitution” of values that everybody who is hired signs
Takeaways
- Maximise impact
  - Dependable tools over fashionable ones
  - Prefer industry-std tech
  - But get creative when necessary
- Define what reliability means for your system
- Evolve with Empathy
  - Don’t be dogmatic without structure
  - Serve your customers and your team
  - Adapt when necessary
  - Talk to people
  - Be explicit as to the tradeoffs you are making

Anthony Borton – Four lessons learnt from Microsoft’s DevOps Transformation

Microsoft starting in 1975
93k odd engineers at Microsoft
- 78k deployments per day
- 2m commits per month
- 4.4 builds/month
- 500 million tests/day
2018 State of Devops reports looks at Elite performers in the space
TFS – Team Foundation Server
- Move product to the cloud
- Moved on-prem to one instance
- Each account had it’s own DB (broke stuff at 11k DBs)
4 lessons
- Customer focussed
  - Listen to customers, uservoice.com
  - Lots of team members keep eye on it
  - Stackoverflow
  - Embed with customers
  - Feedback inside product
  - Have to listen in all the channels
- Failure is an opportunity to learn
- Definition of done
  - Live in prod, collecting telemetary that examines hypotheses that it was created to prove
- “For those of you who don’t know who Encarta is, look it up on Wikipedia”
Team Structure
- Combined engineering = devs + testing
  - Some testers left team or organisation
- Feature team
  - Physical team rooms
  - Cross discipline
  - 10-12 people
  - self managing
  - Clear charter and goals
  - Intact for 12-18 months
- Sticky note exercise, people pick which teams they would like to join (first 3 choices)
  - 20% choose to change
  - 100% get the choice
New constants and requirements
- Problems
  - Tests took too long – 22h to 2days
  - Tests failed frequently – On 60% passed 100%
  - Quality signal unreliable in master
- Publish VSTS Quality vision
  - Sorted by exteranl dependancies
  - Unit tests
    - L0 – in-memory unit tests
    - L1 – More with SQL
  - Functional Tests
    - L2 – Functional tests against testable service deployment
    - L3 – Restricted class integration tests that run against production
  - 83k L0 tests run agains all pulls very fast
Deploy to rings of users
- Ring 0 – Internal Only
- Ring 1 – Small Datacentre 1-1.5m accounts in Brazil (same TZ as US)
- Ring 2 – Public accounts, medium-large data centre
- Ring 3 – Large internal accounts
- Ring 4-5 – everyone else
- Takes about a week for normal releases.
- Binaries go out and then the database changes
- Delays of minutes (up to 75) during the deploys to allow bugs to manafest
- Some customers have access to feature flags
- Customers who are risk tolerant can opt in to early deploys. Allows them to get faster feedback from people who are able to provide it
More features delivered in 2016 than previous 4 years. 50% more in 2017