2017 Sysadmin Miniconf – Session 1

The Opposite of the Cloud – Tom Eastman

  • Korinates Data gateway – an appliance onsite at customers
  • Requirements
    • A bootable images ova, AMI/cloud images
    • Needs network access
    • Sounds like an IoT device
  • Opoossite of cloud is letting somebody outsource their stuff onto your infrastructure
  • Tom’s job has been making a nice and tidy appliance
  • What does IoT get wrong
    • Don’t do updates, security patches
    • Don’t treat network as hostile
    • Hard to remotely admin
  • How to make them secure
    • no default or static credentials
    • reduce the attack surface
    • secure all networks comms
    • ensure it fails securely
  • Solution
    • Don’t treat appliances like appliances
    • Treat like tightly orchestrated Linux Servers
  • Stick to conserative archetecture
    • Use standard distribution like Debian
    • You can trust the standard security updates
  • Solution Components
    • aspen: A customized Debian machine image built with Packer
    • pando: orchestration server/C&C network
    • hakea: A Django/Rest microservice API in charge
  • saltstack command and control
    • Normal orchestration stuff
    • Can works as a distributed command execution
    • The minions on each server connect to the central node, means you don’t need to connect into a remote appliance (no incoming connections needed to appliance)
    • OpenVPN as Internet transport
    • Outgoing just port 443 and openvpn protocol. Everything else via OpenVPN
  • What is the Appliance
    • A lightly mangled Debian Jessie VM image
    • Easy to maintain by customer, just reboot, activate or reinstall to fix any problems.
    • Appliance is running a bunch of docker containers
  • Appliance authentication
    • Needs to connect via 443 with activation code to download VPN and Salt short-lived certificates to get started
    • Auth keys only last for 24 hours.
    • If I can’t reach it it kills itself.
  • Hakea: REST control
    • Django REST framework microservices
    • Self documenting using DRF amd CoreAPI Schema
  • DevOps Principals apply beyonf the cloud

Inventory Management with Pallet Jack – Karl-Johan Karlsson

  • Goals
    • Single source of truth
    • Version control
    • Scaleable (to around 1000 machines, 10k objects)
  • Stuff stored as just a file structure
  • Some tools to access
  • Tools to export, eg to kea DHCP config
  • Tools as post-commit hooks for git. Pushes out update via salt etc
  • Various Integrations
    • API
    • Salt

Continuous Dashboard – You DevOps Airbag – Christopher Biggs

  • Dashboard traditionally targeted at OPs
  • Also need to target Devs
    • KPIs and
  • Sales and Support need to know everything to
  • Management want reassurance, Shipping a new feature, you have a hotline to the CEO
  • Customer, do you have something you are ashamed of?
    • Take notice of load spikes
    • Assume customers errors are being acted on, option to notify then when a fix happens
    • What is relivant to support call, most recent outages affecting this customer
    • Remember recent behavour of this customer
  • What kinds of data?
    • Tradditionally: System load indicators, transtion numbers etc
    • Now: Business Goals, unavoidable errors, spikes of errors, location of errors, user experience metrics, health of 3rd party interfaces, App and product reviews
  • What should I put in dashboards
    • Understand the Status-quo
    • Continuously
    • Look at trends over time and releases
    • Think about features holisticly
  • How to get there
    • Like you data as much as your code
    • Experiment with your data
    • tools: nodered.org, blynk.cc, elastic
  • Insert Dashboards into your dev pipeline
    • Code Review, CI, Unit Test, Confirm that alarms actually work via test errors
    • Automate deployment
  • Tools
    • ELK – off the shelf images, good import/export
    • Node-RED – Flow based data processing, nice visual editor, built in dashboarding
    • Blynk – Nice dashboards in Ios or Android. Interactive dashboard editor. Easy to share
  • Social Media integration
    • Receive from twitter, facebook, apps stores reviews
    • Post to slack and monitoring channels
    • Forward to internal groups

The Sound of Silencing – Julien Goodwin

  • Humans know to ignore “expected” alerts during maintenance
    • Hard to know what is expected vs unexpected
    • Major events can lead to alert overload
  • Level 1 – Turn it all off
    • Can work on small scale
  • Level 2 – Turn off a localtion while working on it.
    • What if something happens while you are doing the work?
    • May work with single-service deployments
  • Level 3 – Turn off the expect alerts
    • Hard to get exactly right
  • Level 4 – Change mngt integration
    • Link the generator up to th change mngt automation system
    • What about changes too small to track?
    • What about changes too big for a simple silence?
  • Level 5 – Inhibiting Alerts
    • Use Service level indigations to avoid alerts on expected failures
    • Fire “goes nowhere” alert
  • Level 6 – Global monitoring and preventing over-siliencing
    • Alert if too many sites down
    • Need to have explicit alerts to spot when somebody silences “*”
  • How to get there from here
    • Incrementally
    • Choose a bad alert and change it to make it better
    • Regularly