DevOpsDaysNZ 2018 – Day 2 – Session 1

Alison Polton-Simon – The DevOps Experiments: Reflections From a Scaling Startup

  • Software engineer at Angaza, previously Thoughtworks, “Organizational Anthropologist”
  • Angaza
    • Enable sales of life-changing products (eg solar chargers, water pumps, cook stoves in 3rd world countries)
    • Originally did hardware, changed to software company doing pay-as-you-go of existing devices
    • ~50 people. Kenya and SF, 50% engineering
    • No dedicated Ops
    • Innovate with empathy, Maximise impact
    • Model is to provide software tools to activate/deactivate products sold to peopel with low credit-scores. Plus out software around the activity like reports for distributors.
  • Reliability
    • Platform is business critical
    • Outages disrupt real people (households without light, farmers without irrigation)
    • Buildkite, Grafana, Zendesk
  • Constraints
    • Operate in 30+ countries
    • Low connectivity, 2G networks best case
    • Rural and peri-urban areas
    • Team growing by 50% in 2018 (2 eng teams in Kenya + 1 QA)
    • Most customers in timezone where day = SF night
  • Eras of experimentation
    • Ad Hoc
    • Tributes (sacrifice yourself for the stake of the team)
    • Collectives (multiple teams)
    • Product teams
  • ad Hoc – 5 eng
    • 1 eng team
    • Ops by day – you broke, you fix
    • Ops by night – Pagerduty rotation
    • Paged on all backend exception, 3 pages = amnesty
    • Good
      • Small but senior
      • JIT maturity
      • Everyone sitting next to each other
    • Bad
      • Each incident higherly disruptive
      • prioritized necessity over sustainability
  • Tribute – 5-12 eng
    • One person protecting team from interuptions
    • Introduced support person and triage
    • Expanded PD rotation
    • Good
      • More sustainable
      • Blue-Green deploys
      • Clustered workloads
    • Not
      • Headcount != horizontal scaling
      • Hard to hire
      • Customer service declined
  • Collectives 13-20 engs
    • Support and Ops teams – Ops staffed with devs
    • Other teams build roadmaps and requests
    • Teams rotate quarterly – helps onboarding
    • Good
      • Cross train ppl
      • Allow for focus
      • allowed ppl to get depth
    • Bad
      • Teams don’t op what they built
      • Quarter flies by quickly
      • Context switch is costly
      • Still a juggling act
      • 1m ramping up, 6w going okay, 2w winging down
  • Product teams  21 -? eng
    • 5 teams, 2 in Nairobi
    • Teams allighned with business virticals, KPIs
    • Dev, own and maintain services
    • Per-team tributes
    • No [Dev]Ops team
    • Intended goals
      • Independent teams
      • own what build
      • Support biz KPIs
      • cross team coordination
    • Expected Chellenges
      • Ownership without responsbility
      • Global knowledge sharing
      • Return to tribute system (2w out of the workflow)
  • Next
    • Keep growing team
    • Working groups
    • Eventual SRE
    • 24h global coverage
  • Case a “Constitution” of values that everybody who is hired signs
  • Takeaways
    • Maximise impact
      • Dependable tools over fashionable ones
      • Prefer industry-std tech
      • But get creative when necessary
    • Define what reliability means for your system
    • Evolve with Empathy
      • Don’t be dogmatic without structure
      • Serve your customers and your team
      • Adapt when necessary
      • Talk to people
      • Be explicit as to the tradeoffs you are making

Anthony Borton – Four lessons learnt from Microsoft’s DevOps Transformation

  • Microsoft starting in 1975
  • 93k odd engineers at Microsoft
    • 78k deployments per day
    • 2m commits per month
    • 4.4 builds/month
    • 500 million tests/day
  • 2018 State of Devops reports looks at Elite performers in the space
  • TFS – Team Foundation Server
    • Move product to the cloud
    • Moved on-prem to one instance
    • Each account had it’s own DB (broke stuff at 11k DBs)
  • 4 lessons
    • Customer focussed
      • Listen to customers, uservoice.com
      • Lots of team members keep eye on it
      • Stackoverflow
      • Embed with customers
      • Feedback inside product
      • Have to listen in all the channels
    • Failure is an opportunity to learn
    • Definition of done
      • Live in prod, collecting telemetary that examines hypotheses that it was created to prove
    • “For those of you who don’t know who Encarta is, look it up on Wikipedia”
  • Team Structure
    • Combined engineering = devs + testing
      • Some testers left team or organisation
    • Feature team
      • Physical team rooms
      • Cross discipline
      • 10-12 people
      • self managing
      • Clear charter and goals
      • Intact for 12-18 months
    • Sticky note exercise, people pick which teams they would like to join (first 3 choices)
      • 20% choose to change
      • 100% get the choice
  • New constants and requirements
    • Problems
      • Tests took too long – 22h to 2days
      • Tests failed frequently – On 60% passed 100%
      • Quality signal unreliable in master
    • Publish VSTS Quality vision
      • Sorted by exteranl dependancies
      • Unit tests
        • L0 – in-memory unit tests
        • L1 – More with SQL
      • Functional Tests
        • L2 – Functional tests against testable service deployment
        • L3 – Restricted class integration tests that run against production
      • 83k L0 tests run agains all pulls very fast
  • Deploy to rings of users
    • Ring 0 – Internal Only
    • Ring 1 – Small Datacentre 1-1.5m accounts in Brazil (same TZ as US)
    • Ring 2 – Public accounts, medium-large data centre
    • Ring 3 – Large internal accounts
    • Ring 4-5 – everyone else
    • Takes about a week for normal releases.
    • Binaries go out and then the database changes
    • Delays of minutes (up to 75) during the deploys to allow bugs to manafest
    • Some customers have access to feature flags
    • Customers who are risk tolerant can opt in to early deploys. Allows them to get faster feedback from people who are able to provide it
  • More features delivered in 2016 than previous 4 years. 50% more in 2017

 

Share