Martina Iglesias – Automatic Discovery of Service metadata for systems at scale
Backend developer at Spotify
Spotify Scale
– 100m active users
– 800+ tech employees
– 120 teams
– Microservices architecture
Walk though Sample artist’s page
– each component ( playlist, play count, discgraphy) is a seperate service
– Aggregated to send result back to client
Hard to co-ordinate between services as scale grows
– 1000+ services
– Each need to use each others APIs
– Dev teams all around the world
Previous Solution
– Teams had docs in different places
– Some in Wiki, Readme, markdown, all different
Current Solution – System Z
– Centralise in one place, as automated as possible
– Internal application
– Web app, catalog of all systems and its parts
– Well integrated with Apollo service
Web Page for each service
– Various tabs
– Configuration (showing versions of build and uptimes)
– API – list of all endpoints for service, scheme, errors codes, etc (automatically populated)
– System tab – Overview on how service is connected to other services, dependencies (generated automatically)
Registration
– System Z gets information from Apollo and prod servers about each service that has been registered
Apollo
– Java libs for writing microservices
– Open source
Apollo-meta
– Metadata module
– Exposes endpoint with metadata for each service
– Exposes
– instance info – versions, uptime
– configuration – currently loaded config of the service
– endpoints –
– call information – monitors service and learns and returns what incoming and outgoing services the service actually does and to/from what other services.
– Automatically builds dependencies
Situation Now
– Quicker access to relevant information
– Automated boring stuff
– All in one place
Learnings
– Think about growth and scaling at the start of the project
Documentation generators
-Apollo
– Swagger.io
– ralm.org
Blog: labs.spotify.com
Jobs: spotify.com/jobs
Q: How to handle breaking APIs
A: We create new version of API endpoint and encourage people to move over.
Bridget Cowie – The story of a performance outage, and how we could have prevented it
– Works for Datacom
– Consultant in Application performance management team
Story from Start of 2015
– Friday night phone calls from your boss are never good.
– Dropped in application monitoring tools (Dynatrace) on Friday night, watch over weekend
– Prev team pretty sure problem is a memory leak but had not been able to find it (for two weeks)
– If somebody tells you they know what is wrong but can’t find it, give details or fix it then be suspicious
Book: Java Enterprise performance
– Monday prod load goes up and app starts crashing
– Told ops team but since crash wasn’t visable yet, was not believed. waited
Tech Stack
– Java App, Jboss on Linux
– Multiple JVMs
– Oracle DBs, Mulesoft ESB, ActiveMQ, HornetQ
Ah Ha moment
– Had a look at import process
– 2.3 million DB queries per half hour
– With max of 260 users, seems way more than what is needed
– Happens even when nobody is logged in
Tip: Typically 80% of all issues can be detected in dev or test if you look for them.
Where did this code come from?
– Process to import a csv into the database
– 1 call mule -> 12 calls to AMQ -> 12 calls to App -> 102 db queries
– Passes all the tests… But
– Still shows huge growth in queries as we go through layers
– DB queries grow bigger with each run
Tip: Know how your code behaves and track how this behavour changes with each code change (or even with no code change)
Q: Why Dynatrace?
A: Quick to deploy, useful info back in only a couple of hours