2017 Sysadmin Miniconf – Session 1

The Opposite of the Cloud – Tom Eastman

Korinates Data gateway – an appliance onsite at customers
Requirements
- A bootable images ova, AMI/cloud images
- Needs network access
- Sounds like an IoT device
Opoossite of cloud is letting somebody outsource their stuff onto your infrastructure
Tom’s job has been making a nice and tidy appliance
What does IoT get wrong
- Don’t do updates, security patches
- Don’t treat network as hostile
- Hard to remotely admin
How to make them secure
- no default or static credentials
- reduce the attack surface
- secure all networks comms
- ensure it fails securely
Solution
- Don’t treat appliances like appliances
- Treat like tightly orchestrated Linux Servers
Stick to conserative archetecture
- Use standard distribution like Debian
- You can trust the standard security updates
Solution Components
- aspen: A customized Debian machine image built with Packer
- pando: orchestration server/C&C network
- hakea: A Django/Rest microservice API in charge
saltstack command and control
- Normal orchestration stuff
- Can works as a distributed command execution
- The minions on each server connect to the central node, means you don’t need to connect into a remote appliance (no incoming connections needed to appliance)
- OpenVPN as Internet transport
- Outgoing just port 443 and openvpn protocol. Everything else via OpenVPN
What is the Appliance
- A lightly mangled Debian Jessie VM image
- Easy to maintain by customer, just reboot, activate or reinstall to fix any problems.
- Appliance is running a bunch of docker containers
Appliance authentication
- Needs to connect via 443 with activation code to download VPN and Salt short-lived certificates to get started
- Auth keys only last for 24 hours.
- If I can’t reach it it kills itself.
Hakea: REST control
- Django REST framework microservices
- Self documenting using DRF amd CoreAPI Schema
DevOps Principals apply beyonf the cloud

Inventory Management with Pallet Jack – Karl-Johan Karlsson

Goals
- Single source of truth
- Version control
- Scaleable (to around 1000 machines, 10k objects)
Stuff stored as just a file structure
Some tools to access
Tools to export, eg to kea DHCP config
Tools as post-commit hooks for git. Pushes out update via salt etc
Various Integrations
- API
- Salt

Continuous Dashboard – You DevOps Airbag – Christopher Biggs

Dashboard traditionally targeted at OPs
Also need to target Devs
- KPIs and
Sales and Support need to know everything to
Management want reassurance, Shipping a new feature, you have a hotline to the CEO
Customer, do you have something you are ashamed of?
- Take notice of load spikes
- Assume customers errors are being acted on, option to notify then when a fix happens
- What is relivant to support call, most recent outages affecting this customer
- Remember recent behavour of this customer
What kinds of data?
- Tradditionally: System load indicators, transtion numbers etc
- Now: Business Goals, unavoidable errors, spikes of errors, location of errors, user experience metrics, health of 3rd party interfaces, App and product reviews
What should I put in dashboards
- Understand the Status-quo
- Continuously
- Look at trends over time and releases
- Think about features holisticly
How to get there
- Like you data as much as your code
- Experiment with your data
- tools: nodered.org, blynk.cc, elastic
Insert Dashboards into your dev pipeline
- Code Review, CI, Unit Test, Confirm that alarms actually work via test errors
- Automate deployment
Tools
- ELK – off the shelf images, good import/export
- Node-RED – Flow based data processing, nice visual editor, built in dashboarding
- Blynk – Nice dashboards in Ios or Android. Interactive dashboard editor. Easy to share
Social Media integration
- Receive from twitter, facebook, apps stores reviews
- Post to slack and monitoring channels
- Forward to internal groups

The Sound of Silencing – Julien Goodwin

Humans know to ignore “expected” alerts during maintenance
- Hard to know what is expected vs unexpected
- Major events can lead to alert overload
Level 1 – Turn it all off
- Can work on small scale
Level 2 – Turn off a localtion while working on it.
- What if something happens while you are doing the work?
- May work with single-service deployments
Level 3 – Turn off the expect alerts
- Hard to get exactly right
Level 4 – Change mngt integration
- Link the generator up to th change mngt automation system
- What about changes too small to track?
- What about changes too big for a simple silence?
Level 5 – Inhibiting Alerts
- Use Service level indigations to avoid alerts on expected failures
- Fire “goes nowhere” alert
Level 6 – Global monitoring and preventing over-siliencing
- Alert if too many sites down
- Need to have explicit alerts to spot when somebody silences “*”
How to get there from here
- Incrementally
- Choose a bad alert and change it to make it better
- Regularly