February 2016 – Page 2 – Simon Lyall's Blog

Linux.conf.au 2016 – Sysadmin Miniconf – Session 2

Site Reliability Engineering at Dropbox – Tammy Butow

Having a SLA, measuring against it. Caps OPSwork, Blameless Post Mortum, Always be coding
400 M customer, billion files every day
Very hard to find people to scale, so build tool to scale instead
Team looks at 6,000 DB machines, look after whole machines not just the app
Build a lot of tools in python and go
PygerDuty – python library for pagerduty API
- Easy to find the top things paging, write tools to reduce these
- Regular weekly meeting to review those problems and make them better
- If work is happening on machines then turn off monitoring on them so people don’t get woken up for things they don’t need to.
- Going for days without any pages
Self-Healing and auto-remediation scripts
Hermes
- Allocate and track tasks to groups
Automation of DB tasks
Bot copies pagerduty alerts in slack
Aim Higher
- Create a roadmap for next 12 months
- Buiding a rocketship while it is flying though the sky
Follow the Sun so people are working days
Post Mortem for every page
Frequent DR testing
Take time out to celebrate

I missed out writing up the next couple of talks due to technical problems

Linux.conf.au 2016 – Sysadmin Miniconf – Session 1

Is that a Cloud in you packet – Steven Ellis

What if you could have a demo of a stack on a phone
or on a memory stick or a mini raspberry-pi type PC
Nested Virtualisation
Hardware
- Using Linux as host env, not so good on Win and Mac
- Thinkpad, fedora or Centos, 128GB SSD
Nested Virtualisation
- Huge perforance boost over qemu
- Use SSD
- enable options in modules kvm-intel or kvm-amd
- Confirm SSD perf 1st – hdparm -t /dev/sdX
- Create base env for VMs, enable vmx in features
- Make sure it uses a different network so doesn’t badly interact with ones further out
Think LVM
- Creat ethin pool for all envs
- Think on lvm ” issue_discards = 1 “
Base image
- Doesn’t have to be minimal
- update the base regularly
- How do you build your base image?
  - Thin may go weirdly wrong
  - Always use kickstart to re-create it.
- Think of your use case, don’t skim on the disk (eg 40G disk for image)
- ssh keys, Enable yum cache
- Patch once kicked
- keep a content cache, maybe with rsync or mrepo
Turn off VM and hen use fsrim and partx to make it nice and smaller.
virt-manager can’t manage thin volumes, DONT manually add the path
use virsh to manually add the path.
snapshots or snapshots great performance on SSD
Thin longer activates automatically on distros
packstack simple way to install simple openstack setup
LVM vs QCOW
- qcow okay for smaller images
- cloud-init with atomic
- do not snapshot a qcow image when it is running

Revisiting Unix principles for modern system automation – Martin Krafft

SSH Botnet
OSI of System Automation
Transport unix style, both push and pull
uses socat for low level data moving
autossh <- restarts ssh connection automatically
creates control socket

A Gentle Introduction to Ceph – Tim Serong

Ceph gives a storage cluster that is self healing and self managed
3 interfaces, object, block, distributed fs
OSD with files on them, monitor nodes
OSD will forward writes to other replics of the data
clients can read from any OSD
Software defined storage vs legacy appliances
Network
- Fastest you can, seperate public and cluster networks
- cluster fatsre than public
Nodes
- 1-2G ram per TB of storage
- read recomendations
SSD journals to cache writes
Redundancy
- Replications – capacity impact but usually good performance
- Erasure coding – Like raid – better space efficiency but impact in most other areas
Adding more nodes
- tends to work
- temp impact during rebalancing
How to size
- understand you workload
- make a guess
- Build a 10% pilot
- refine to until perf is achieved
- scale up the the pilot

Keeping Pinterest running – Joe Gordon

Software vs service
- No stable versions
- Only one version is live
- Devs support their own service – alligns incentives, eg monitoring built in
- Testing against production traffic
SRE at Pinterest
- Like a pit crew in F1
- firefighting at scale
- changing tires while moving
Operation Maturity
Operation Excellence
- Have the best practices, docs, process, imporvements
- Repeatable deploys
Visability
- data driven company
- Lots of Time series data – TSDB
- Using ELK
Deployments
- no impact to end user
- easy to do, every few minutes
Canary vs Staging
- Send dark (copies) of traffic to canary box without sending anything back to user
- Bounce back to starting if problems
Teletran
- Rollback, hotfix, rolling deploy, starting and testing, visibility and useability
- client-server model
- pre/post download, restart, etc scripts included with every deployment
- puase/resume various testing
Postmortums and Production readyness reviews
Cloud is not infinite, often will hit AWS capacity limits or even no avaialble stuff in the region
Need to be able to make sure you know what you are running and if it i seffecintly used
Open sourced tools
- mysql_utils – lots of tools to manage many DBs
- Thrift tools
- Teletraan – open sourced in Feb 2016
- github.com/pinterest

Linux.conf.au 2016 – Tuesday – Keynote: George Fong

George Fong – Chair of Internet Australia

The Challenges of the Changing Social Significance of the Nerd

“This is the first conference I’ve been to where there’s an extremely high per capita number of ponytails”
Linux not just running web server and other servers but also network devices
Linux and the Web aren’t the same thing, but they’ve grown symbiotically and neither would be the same without the other
“One of the lessons we’ve learned in Australia is that when you mix technology with politics, you get into trouble”
“We have proof in Australia that if you take guns away from people, people stop getting killed”

Linux.conf.au 2016 – Monday – Session 3

Cloud Anti-Patterns – Casey West

The 5 stages of Cloud Native
Deploying my apps to the cloud is painful – why?
Denial
- “Containers are like tiny VMs”
- Anti-Pattern 1 – do not assume what you have now is what you want to put into the cloud or a container
- “We don’t need to automate continuous delivery”
- We shouldn’t automate what we have until it is perfect. Automate to make things consistent (not always perfect at least at the start)
Anger
- “works on my machine”
- Dev is just push straight from dev boxes to production
- Not about making worse code go to production faster
- Aim to repeatable testable builds, just faster
Bargaining
- “We crammed the monolith into a container and called it a microservice”
- Anti-Pattern: Critically think on what you need to re-factor (or “re-platforming” )
- ” Bi-modal IT “
- Some stuff on fast lane, some stuff on old-way slow lane
- Anti-pattern: leagacy products put into slow lane, these are often the ones that really need to be fixed.
- “Micros-services” talking to same data-source, not APIs
Depression
- “200 microservices but forgot to setup Jenkins”
- “We have an automated build pipeline but online release twice per year”
Acceptance
- All software sucks, even the stuff we write
- Respect CAP theorem
- Respect Conway’s Law
- Small batch sizes works for replatforming too
Microservices architecture, Devops culture, Continuous delivery – Pick all three

Cloud Crafting – Public / Private / Hybrid – Steven Ellis

What does Hybrid mean to you?
What is private Cloud (IAAS)
Hybrid – communicate to public cloud and manage local stuff
ManageIQ – single pain of glass for hardware, vms, clounds, containers
What does it do?
- Brownfields as well as Greenfields, gathers current setup
- Discovery, API presentations, control and detect when env non-complient (eg not fully patched)
- Premise or public cloud
- Supplied as a virtual appliance, HA, scale out
- Platform – Centos 7, rails, postgress, gui, some dashboards our of the box.
Get involved
- Online, roadmap is public
- Various contributors
DEMO
Just put in credentials to allow access and then it can gather the data straiht away

Live Migration of Linux Containers by Tycho Andersen

LXC / LXD
LXD is a REST API that you use to control the container system
tool -> RST -> Daemon -> lxc -> Kernel
“lxc move host1:c1 host2: ” – Live migrations
- Needs a bit of work since lots moving, lots of ways it could fail
- 3 channels created, control, filesystem, container processes state
CRIU
- 5 years of check-pointing
- Lots based off open-VZ initial work
- All sorts of things need to support check-pointing and moving (eg selinux)
- Iterative migration added
- Lots of hooks needed for very privileged kernel features
Filesystems
- btrfs, lvm, zfs, (swift, nfs), have special support for migration that it hooks into
- rsync between incompatable hosts
Memory State
- Stop the world and move it all
- Iterative incremental transfer (via p.haul) being worked on.
LXC + LXD 2.0 should be in Ubuntu 16.04 LTS
Need to use latest versions and latest kernels for best results.

Linux.conf.au 2016 – Monday – Session 1

Open Cloud Miniconf – Continuous Delivery using blue-green deployments and immutable infrastructure by Ruben Rubio Rey

Lots of things can go wrong in a deployment
Often hard to do rollbacks once upgrade happens
Blue-Green deployment is running several envs at the same time, each potentially with different versions
Immutable infrastructure , split between data (which changes) and everything else only gets replaced fully by deployments, not changed
When you use docker don’t store data in the container, makes it immutable. But containers are not required to do this.
Rule 1 – Never modify the infrastructure
Rule 2 – Instead of modifying – always create from ground up everything that is not data.
Advantages
- Rollbacks easy
- Avoid Configuration drift
- Updated and accurate infrastructure documentation
Split things up
- No State – LBs, Web servers, App Servers
- Temp data , Volatile State – message queues, email servers
- Persistent data – Databases, Filesystems, slow warming cache
In case of temp data you have to be able to drain
USe LBs and multiple servers to split up infrastructure, more bit give more room to split up the upgrades.
If pending jobs require old/new version of app then route to servers that have/not been upgraded yet.
Put toy rocket launcher in devs office, shoots person who broke the build.
Need to “use activity script” to bleed traffic off section of the “temp data” layer of infrastructure, determine when it is empty and then re-create.