Linux.conf.au 2016 – Sysadmin Miniconf – Session 1

Is that a Cloud in you packet – Steven Ellis

What if you could have a demo of a stack on a phone
or on a memory stick or a mini raspberry-pi type PC
Nested Virtualisation
Hardware
- Using Linux as host env, not so good on Win and Mac
- Thinkpad, fedora or Centos, 128GB SSD
Nested Virtualisation
- Huge perforance boost over qemu
- Use SSD
- enable options in modules kvm-intel or kvm-amd
- Confirm SSD perf 1st – hdparm -t /dev/sdX
- Create base env for VMs, enable vmx in features
- Make sure it uses a different network so doesn’t badly interact with ones further out
Think LVM
- Creat ethin pool for all envs
- Think on lvm ” issue_discards = 1 “
Base image
- Doesn’t have to be minimal
- update the base regularly
- How do you build your base image?
  - Thin may go weirdly wrong
  - Always use kickstart to re-create it.
- Think of your use case, don’t skim on the disk (eg 40G disk for image)
- ssh keys, Enable yum cache
- Patch once kicked
- keep a content cache, maybe with rsync or mrepo
Turn off VM and hen use fsrim and partx to make it nice and smaller.
virt-manager can’t manage thin volumes, DONT manually add the path
use virsh to manually add the path.
snapshots or snapshots great performance on SSD
Thin longer activates automatically on distros
packstack simple way to install simple openstack setup
LVM vs QCOW
- qcow okay for smaller images
- cloud-init with atomic
- do not snapshot a qcow image when it is running

Revisiting Unix principles for modern system automation – Martin Krafft

A Gentle Introduction to Ceph – Tim Serong

Ceph gives a storage cluster that is self healing and self managed
3 interfaces, object, block, distributed fs
OSD with files on them, monitor nodes
OSD will forward writes to other replics of the data
clients can read from any OSD
Software defined storage vs legacy appliances
Network
- Fastest you can, seperate public and cluster networks
- cluster fatsre than public
Nodes
- 1-2G ram per TB of storage
- read recomendations
SSD journals to cache writes
Redundancy
- Replications – capacity impact but usually good performance
- Erasure coding – Like raid – better space efficiency but impact in most other areas
Adding more nodes
- tends to work
- temp impact during rebalancing
How to size
- understand you workload
- make a guess
- Build a 10% pilot
- refine to until perf is achieved
- scale up the the pilot

Keeping Pinterest running – Joe Gordon

Software vs service
- No stable versions
- Only one version is live
- Devs support their own service – alligns incentives, eg monitoring built in
- Testing against production traffic
SRE at Pinterest
- Like a pit crew in F1
- firefighting at scale
- changing tires while moving
Operation Maturity
Operation Excellence
- Have the best practices, docs, process, imporvements
- Repeatable deploys
Visability
- data driven company
- Lots of Time series data – TSDB
- Using ELK
Deployments
- no impact to end user
- easy to do, every few minutes
Canary vs Staging
- Send dark (copies) of traffic to canary box without sending anything back to user
- Bounce back to starting if problems
Teletran
- Rollback, hotfix, rolling deploy, starting and testing, visibility and useability
- client-server model
- pre/post download, restart, etc scripts included with every deployment
- puase/resume various testing
Postmortums and Production readyness reviews
Cloud is not infinite, often will hit AWS capacity limits or even no avaialble stuff in the region
Need to be able to make sure you know what you are running and if it i seffecintly used
Open sourced tools
- mysql_utils – lots of tools to manage many DBs
- Thrift tools
- Teletraan – open sourced in Feb 2016
- github.com/pinterest