DevOpsDays Wellington 2016 – Day 1, Session 2

Martina Iglesias – Automatic Discovery of Service metadata for systems at scale

Backend developer at Spotify

Spotify Scale
– 100m active users
– 800+ tech employees
– 120 teams
– Microservices architecture

Walk though Sample artist’s page
– each component ( playlist, play count, discgraphy) is a seperate service
– Aggregated to send result back to client

Hard to co-ordinate between services as scale grows
– 1000+ services
– Each need to use each others APIs
– Dev teams all around the world

Previous Solution
– Teams had docs in different places
– Some in Wiki, Readme, markdown, all different

Current Solution – System Z
– Centralise in one place, as automated as possible
– Internal application
– Web app, catalog of all systems and its parts
– Well integrated with Apollo service

Web Page for each service
– Various tabs
– Configuration (showing versions of build and uptimes)
– API – list of all endpoints for service, scheme, errors codes, etc (automatically populated)
– System tab – Overview on how service is connected to other services, dependencies (generated automatically)

Registration
– System Z gets information from Apollo and prod servers about each service that has been registered

Apollo
– Java libs for writing microservices
– Open source

Apollo-meta
– Metadata module
– Exposes endpoint with metadata for each service
– Exposes
– instance info – versions, uptime
– configuration – currently loaded config of the service
– endpoints –
– call information – monitors service and learns and returns what incoming and outgoing services the service actually does and to/from what other services.
– Automatically builds dependencies

Situation Now
– Quicker access to relevant information
– Automated boring stuff
– All in one place

Learnings
– Think about growth and scaling at the start of the project

Documentation generators
-Apollo
– Swagger.io
– ralm.org

Blog: labs.spotify.com
Jobs: spotify.com/jobs

Q: How to handle breaking APIs
A: We create new version of API endpoint and encourage people to move over.

Bridget Cowie – The story of a performance outage, and how we could have prevented it

– Works for Datacom
– Consultant in Application performance management team

Story from Start of 2015

– Friday night phone calls from your boss are never good.
– Dropped in application monitoring tools (Dynatrace) on Friday night, watch over weekend
– Prev team pretty sure problem is a memory leak but had not been able to find it (for two weeks)
– If somebody tells you they know what is wrong but can’t find it, give details or fix it then be suspicious

Book: Java Enterprise performance

– Monday prod load goes up and app starts crashing
– Told ops team but since crash wasn’t visable yet, was not believed. waited

Tech Stack
– Java App, Jboss on Linux
– Multiple JVMs
– Oracle DBs, Mulesoft ESB, ActiveMQ, HornetQ

Ah Ha moment
– Had a look at import process
– 2.3 million DB queries per half hour
– With max of 260 users, seems way more than what is needed
– Happens even when nobody is logged in

Tip: Typically 80% of all issues can be detected in dev or test if you look for them.

Where did this code come from?
– Process to import a csv into the database
– 1 call mule -> 12 calls to AMQ -> 12 calls to App -> 102 db queries
– Passes all the tests… But
– Still shows huge growth in queries as we go through layers
– DB queries grow bigger with each run

Tip: Know how your code behaves and track how this behavour changes with each code change (or even with no code change)

Q: Why Dynatrace?
A: Quick to deploy, useful info back in only a couple of hours