Linux.conf.au 2016 – Sysadmin Miniconf – Session 2

Site Reliability Engineering at Dropbox – Tammy Butow

  • Having a SLA, measuring against it. Caps OPSwork, Blameless Post Mortum, Always be coding
  • 400 M customer, billion files every day
  • Very hard to find people to scale, so build tool to scale instead
  • Team looks at 6,000 DB machines, look after whole machines not just the app
  • Build a lot of tools in python and go
  • PygerDuty – python library for pagerduty API
    • Easy to find the top things paging, write tools to reduce these
    • Regular weekly meeting to review those problems and make them better
    • If work is happening on machines then turn off monitoring on them so people don’t get woken up for things they don’t need to.
    • Going for days without any pages
  • Self-Healing and auto-remediation scripts
  • Hermes
    • Allocate and track tasks to groups
  • Automation of DB tasks
  • Bot copies pagerduty alerts in slack
  • Aim Higher
    • Create a roadmap for next 12 months
    • Buiding a rocketship while it is flying though the sky
  • Follow the Sun so people are working days
  • Post Mortem for every page
  • Frequent DR testing
  • Take time out to celebrate

I missed out writing up the next couple of talks due to technical problems

 

Share

Linux.conf.au 2016 – Sysadmin Miniconf – Session 1

Is that a Cloud in you packet – Steven Ellis

  • What if you could have a demo of a stack on a phone
  • or on a memory stick or a mini raspberry-pi type PC
  • Nested Virtualisation
  • Hardware
    • Using Linux as host env, not so good on Win and Mac
    • Thinkpad, fedora or Centos, 128GB SSD
  • Nested Virtualisation
    • Huge perforance boost over qemu
    • Use SSD
    • enable options in modules kvm-intel or kvm-amd
    • Confirm SSD perf 1st – hdparm -t /dev/sdX
    • Create base env for VMs, enable vmx in features
    • Make sure it uses a different network so doesn’t badly interact with ones further out
  • Think LVM
    • Creat ethin pool for all envs
    • Think on lvm ” issue_discards = 1 “
  • Base image
    • Doesn’t have to be minimal
    • update the base regularly
    • How do you build your base image?
      • Thin may go weirdly wrong
      • Always use kickstart to re-create it.
    • Think of your use case, don’t skim on the disk (eg 40G disk for image)
    • ssh keys, Enable yum cache
    • Patch once kicked
    • keep a content cache, maybe with rsync or mrepo
  • Turn off VM and hen use fsrim and partx to make it nice and smaller.
  • virt-manager can’t manage thin volumes, DONT manually add the path
  • use virsh to manually add the path.
  • snapshots or snapshots great performance on SSD
  • Thin longer activates automatically on distros
  • packstack simple way to install simple openstack setup
  • LVM vs QCOW
    • qcow okay for smaller images
    • cloud-init with atomic
    • do not snapshot a qcow image when it is running

Revisiting Unix principles for modern system automation – Martin Krafft

  • SSH Botnet
  • OSI of System Automation
  • Transport unix style, both push and pull
  • uses socat for low level data moving
  • autossh <- restarts ssh connection automatically
  • creates control socket

A Gentle Introduction to Ceph – Tim Serong

  • Ceph gives a storage cluster that is self healing and self managed
  • 3 interfaces, object, block, distributed fs
  • OSD with files on them, monitor nodes
  • OSD will forward writes to other replics of the data
  • clients can read from any OSD
  • Software defined storage vs legacy appliances
  • Network
    • Fastest you can, seperate public and cluster networks
    • cluster fatsre than public
  • Nodes
    • 1-2G ram per TB of storage
    • read recomendations
  • SSD journals to cache writes
  • Redundancy
    • Replications – capacity impact but usually good performance
    • Erasure coding – Like raid – better space efficiency but impact in most other areas
  • Adding more nodes
    • tends to work
    • temp impact during rebalancing
  • How to size
    • understand you workload
    • make a guess
    • Build a 10% pilot
    • refine to until perf is achieved
    • scale up the the pilot

Keeping Pinterest running – Joe Gordon

  • Software vs service
    • No stable versions
    • Only one version is live
    • Devs support their own service – alligns incentives, eg monitoring built in
    • Testing against production traffic
  • SRE at Pinterest
    • Like a pit crew in F1
    • firefighting at scale
    • changing tires while moving
  • Operation Maturity
  • Operation Excellence
    • Have the best practices, docs, process, imporvements
    • Repeatable deploys
  • Visability
    • data driven company
    • Lots of Time series data – TSDB
    • Using ELK
  • Deployments
    • no impact to end user
    • easy to do, every few minutes
  • Canary vs Staging
    • Send dark (copies) of traffic to canary box without sending anything back to user
    • Bounce back to starting if problems
  • Teletran
    • Rollback, hotfix, rolling deploy, starting and testing, visibility and useability
    • client-server model
    • pre/post download, restart, etc scripts included with every deployment
    • puase/resume various testing
  • Postmortums and Production readyness reviews
  • Cloud is not infinite, often will hit AWS capacity limits or even no avaialble stuff in the region
  • Need to be able to make sure you know what you are running and if it i seffecintly used
  • Open sourced tools
    • mysql_utils – lots of tools to manage many DBs
    • Thrift tools
    • Teletraan – open sourced in Feb 2016
    • github.com/pinterest
Share

Linux.conf.au 2016 – Tuesday – Keynote: George Fong

George Fong – Chair of Internet Australia

The Challenges of the Changing Social Significance of the Nerd

  • “This is the first conference I’ve been to where there’s an extremely high per capita number of ponytails”
  • Linux not just running web server and other servers  but also network devices
  • Linux and the Web aren’t the same thing, but they’ve grown symbiotically and neither would be the same without the other
  • “One of the lessons we’ve learned in Australia is that when you mix technology with politics, you get into trouble”
  • “We have proof in Australia that if you take guns away from people, people stop getting killed”

 

Share

Linux.conf.au 2016 – Monday – Session 3

Cloud Anti-Patterns – Casey West

  • The 5 stages of Cloud Native
  • Deploying my apps to the cloud is painful – why?
  • Denial
    • “Containers are like tiny VMs”
    • Anti-Pattern 1 – do not assume what you have now is what you want to put into the cloud or a container
    • “We don’t need to automate continuous delivery”
    • We shouldn’t automate what we have until it is perfect. Automate to make things consistent (not always perfect at least at the start)
  • Anger
    • “works on my machine”
    • Dev is just push straight from dev boxes to production
    • Not about making worse code go to production faster
    • Aim to repeatable testable builds, just faster
  • Bargaining
    • “We crammed the monolith into a container and called it a microservice”
    • Anti-Pattern: Critically think on what you need to re-factor (or “re-platforming” )
    • ” Bi-modal IT “
    • Some stuff on fast lane, some stuff on old-way slow lane
    • Anti-pattern: leagacy products put into slow lane, these are often the ones that really need to be fixed.
    • “Micros-services” talking to same data-source, not APIs
  • Depression
    • “200 microservices but forgot to setup Jenkins”
    • “We have an automated build pipeline but online release twice per year”
  • Acceptance
    • All software sucks, even the stuff we write
    • Respect CAP theorem
    • Respect Conway’s Law
    • Small batch sizes works for replatforming too
  • Microservices architecture, Devops culture, Continuous delivery – Pick all three

Cloud Crafting – Public / Private / Hybrid  – Steven Ellis

  • What does Hybrid mean to you?
  • What is private Cloud (IAAS)
  • Hybrid – communicate to public cloud and manage local stuff
  • ManageIQ – single pain of glass for hardware, vms, clounds, containers
  • What does it do?
    • Brownfields as well as Greenfields, gathers current setup
    • Discovery, API presentations, control and detect when env non-complient (eg not fully patched)
    • Premise or public cloud
    • Supplied as a virtual appliance, HA, scale out
    • Platform – Centos 7, rails, postgress, gui, some dashboards our of the box.
  • Get involved
    • Online, roadmap is public
    • Various contributors
  • DEMO
  • Just put in credentials to allow access and then it can gather the data straiht away

Live Migration of Linux Containers by Tycho Andersen

  • LXC / LXD
  • LXD is a REST API that you use to control the container system
  • tool -> RST -> Daemon -> lxc -> Kernel
  • “lxc move host1:c1 host2: ” – Live migrations
    • Needs a bit of work since lots moving, lots of ways it could fail
    • 3 channels created, control, filesystem, container processes state
  • CRIU
    • 5 years of check-pointing
    • Lots based off open-VZ initial work
    • All sorts of things need to support check-pointing and moving (eg selinux)
    • Iterative migration added
    • Lots of hooks needed for very privileged kernel features
  • Filesystems
    • btrfs, lvm, zfs, (swift, nfs), have special support for migration that it hooks into
    • rsync between incompatable hosts
  • Memory State
    • Stop the world and move it all
    • Iterative incremental transfer (via p.haul) being worked on.
  • LXC + LXD 2.0 should be in Ubuntu 16.04 LTS
  • Need to use latest versions and latest kernels for best results.
Share

Linux.conf.au 2016 – Monday – Session 1

Open Cloud Miniconf – Continuous Delivery using blue-green deployments and immutable infrastructure by Ruben Rubio Rey

  • Lots of things can go wrong in a deployment
  • Often hard to do rollbacks once upgrade happens
  • Blue-Green deployment is running several envs at the same time, each potentially with different versions
  • Immutable infrastructure , split between data (which changes) and everything else only gets replaced fully by deployments, not changed
  • When you use docker don’t store data in the container, makes it immutable. But containers are not required to do this.
  • Rule 1 – Never modify the infrastructure
  • Rule 2 – Instead of modifying – always create from ground up everything that is not data.
  • Advantages
    • Rollbacks easy
    • Avoid Configuration drift
    • Updated and accurate infrastructure documentation
  • Split things up
    • No State – LBs, Web servers, App Servers
    • Temp data , Volatile State – message queues, email servers
    • Persistent data – Databases, Filesystems, slow warming cache
  • In case of temp data you have to be able to drain
  • USe LBs and multiple servers to split up infrastructure, more bit give more room to split up the upgrades.
  • If pending jobs require old/new version of app then route to servers that have/not been upgraded yet.
  • Put toy rocket launcher in devs office, shoots person who broke the build.
  • Need to “use activity script” to bleed traffic off section of the “temp data” layer of infrastructure, determine when it is empty and then re-create.
Share