Linux.conf.au 2016 – Sysadmin Miniconf – Session 3

The life of a Sysadmin in a research environment – Eric Burgueno

  • Everything must be reproducible
  • Keeping system up as long as possible, not have an overall uptime percentage
  • One person needs to cover lots of roles rather than specialise
  • 2 Servers with 2TB of RAM. Others smaller according to need
  • Lots of varied tools mostly bioinformatics software
  • 90TB to over 200TB of data over 2 years. Lots of large files. Big files, big servers.
  • Big job using 2TB of RAM taking 8 days to run.
  • The 2*2TB servers can be joined togeather to create a single 4TB server
  • Have to customize environment for each tool, hard when there have lots of tools and also want to compare/collaborate against other places where software is being run.
  • Reproducible(?) Research

Creating bespoke logging systems and dashboards with Grafana, in fifteen minutes – Andrew McDonnell

Live Demo

Order in the chaos: or lessons learnt on planning in operations – Peter Hall

  • Lead of an Ops team at REA group. Looks after dev teams for 10-15 applications
  • Ops is not a project, but works with many projects
  • Many sources of work, dev, security, incidents, infrastructure improvement
  • Understand the work
    • Document your work
    • Talk about it, 15min standup
  • Scedule things
    • and prepare for the unplanned
    • Perhaps 2 weeks
    • Leave lots of slack
  • Interruptions
    • Assign team members to each ops teams
    • Rotating “ops goal keeper”
    • Developers on pager
  • Review Often
  • Longer term goals for your team
  • Failure demand vs value demand.
    • Make sure [at least some of] what you are doing is adding value to the environment

 

From Commit to Cloud – Daniel Hall

  • Deployments should be:
    • fast – 10 minutes
    • small – only one feature change and person doing should be aware of all of what is changing
    • easy – little human work as possible, simple to understand
  • We believe this because
    • less to break
    • devs should focus on dev
    • each project should be really easy to learn, devs can switch between projects easy
    • Don’t want anyone from being afraid to deploy
  • Able to rollback
    • 30 microservices
    • 2 devs plus some work from others
  • How to do it
    • Microservices arch (optional but helps)
    • git , build agent, packaging format with dependencies
    • something to run you stuff
  • code -> git -> built -> auto test -> package -> staging -> test -> deploy to prod
  • Application is built triggere by git
    • build.sh script in each repo
  • Auto test after build, don’t do end-to-end testing, do that in staging
  • Package app – they use docker – push to internal docker repo
  • Deploy to staging – they use curl to push json mesos/matathon with pulls container. Testing run there
  • Single Click approval to deploy to staging
  • Deploy to prod – should be same as how you deploy to staging.

LNAV – Paul Wayper

  • Point at a dir. read all the files. sort all the lines together in timestamp order
  • Colour codes, machines, different facilities(daemons). Highlights IPs addresses
  • Errors lines in red, warning lines in yellow
  • Regular expressions highlighted. Fully pcre compatable
  • Able to move back and force and hour or a day at a time with special keys
  • Histograph of error lines, number per minutes etc
  • more complete (SQL like) queries
  • compiles as a static binary
  • Ability to add your own log file formats
  • Ability share format filters with others
  • Doesn’t deal with journald logs
  • Availbale for spel, fedora, debian but under a lot of active development.
  • acts like tail -f to spot updates to logs.
Share

Linux.conf.au 2016 – Sysadmin Miniconf – Session 2

Site Reliability Engineering at Dropbox – Tammy Butow

  • Having a SLA, measuring against it. Caps OPSwork, Blameless Post Mortum, Always be coding
  • 400 M customer, billion files every day
  • Very hard to find people to scale, so build tool to scale instead
  • Team looks at 6,000 DB machines, look after whole machines not just the app
  • Build a lot of tools in python and go
  • PygerDuty – python library for pagerduty API
    • Easy to find the top things paging, write tools to reduce these
    • Regular weekly meeting to review those problems and make them better
    • If work is happening on machines then turn off monitoring on them so people don’t get woken up for things they don’t need to.
    • Going for days without any pages
  • Self-Healing and auto-remediation scripts
  • Hermes
    • Allocate and track tasks to groups
  • Automation of DB tasks
  • Bot copies pagerduty alerts in slack
  • Aim Higher
    • Create a roadmap for next 12 months
    • Buiding a rocketship while it is flying though the sky
  • Follow the Sun so people are working days
  • Post Mortem for every page
  • Frequent DR testing
  • Take time out to celebrate

I missed out writing up the next couple of talks due to technical problems

 

Share

Linux.conf.au 2016 – Sysadmin Miniconf – Session 1

Is that a Cloud in you packet – Steven Ellis

  • What if you could have a demo of a stack on a phone
  • or on a memory stick or a mini raspberry-pi type PC
  • Nested Virtualisation
  • Hardware
    • Using Linux as host env, not so good on Win and Mac
    • Thinkpad, fedora or Centos, 128GB SSD
  • Nested Virtualisation
    • Huge perforance boost over qemu
    • Use SSD
    • enable options in modules kvm-intel or kvm-amd
    • Confirm SSD perf 1st – hdparm -t /dev/sdX
    • Create base env for VMs, enable vmx in features
    • Make sure it uses a different network so doesn’t badly interact with ones further out
  • Think LVM
    • Creat ethin pool for all envs
    • Think on lvm ” issue_discards = 1 “
  • Base image
    • Doesn’t have to be minimal
    • update the base regularly
    • How do you build your base image?
      • Thin may go weirdly wrong
      • Always use kickstart to re-create it.
    • Think of your use case, don’t skim on the disk (eg 40G disk for image)
    • ssh keys, Enable yum cache
    • Patch once kicked
    • keep a content cache, maybe with rsync or mrepo
  • Turn off VM and hen use fsrim and partx to make it nice and smaller.
  • virt-manager can’t manage thin volumes, DONT manually add the path
  • use virsh to manually add the path.
  • snapshots or snapshots great performance on SSD
  • Thin longer activates automatically on distros
  • packstack simple way to install simple openstack setup
  • LVM vs QCOW
    • qcow okay for smaller images
    • cloud-init with atomic
    • do not snapshot a qcow image when it is running

Revisiting Unix principles for modern system automation – Martin Krafft

  • SSH Botnet
  • OSI of System Automation
  • Transport unix style, both push and pull
  • uses socat for low level data moving
  • autossh <- restarts ssh connection automatically
  • creates control socket

A Gentle Introduction to Ceph – Tim Serong

  • Ceph gives a storage cluster that is self healing and self managed
  • 3 interfaces, object, block, distributed fs
  • OSD with files on them, monitor nodes
  • OSD will forward writes to other replics of the data
  • clients can read from any OSD
  • Software defined storage vs legacy appliances
  • Network
    • Fastest you can, seperate public and cluster networks
    • cluster fatsre than public
  • Nodes
    • 1-2G ram per TB of storage
    • read recomendations
  • SSD journals to cache writes
  • Redundancy
    • Replications – capacity impact but usually good performance
    • Erasure coding – Like raid – better space efficiency but impact in most other areas
  • Adding more nodes
    • tends to work
    • temp impact during rebalancing
  • How to size
    • understand you workload
    • make a guess
    • Build a 10% pilot
    • refine to until perf is achieved
    • scale up the the pilot

Keeping Pinterest running – Joe Gordon

  • Software vs service
    • No stable versions
    • Only one version is live
    • Devs support their own service – alligns incentives, eg monitoring built in
    • Testing against production traffic
  • SRE at Pinterest
    • Like a pit crew in F1
    • firefighting at scale
    • changing tires while moving
  • Operation Maturity
  • Operation Excellence
    • Have the best practices, docs, process, imporvements
    • Repeatable deploys
  • Visability
    • data driven company
    • Lots of Time series data – TSDB
    • Using ELK
  • Deployments
    • no impact to end user
    • easy to do, every few minutes
  • Canary vs Staging
    • Send dark (copies) of traffic to canary box without sending anything back to user
    • Bounce back to starting if problems
  • Teletran
    • Rollback, hotfix, rolling deploy, starting and testing, visibility and useability
    • client-server model
    • pre/post download, restart, etc scripts included with every deployment
    • puase/resume various testing
  • Postmortums and Production readyness reviews
  • Cloud is not infinite, often will hit AWS capacity limits or even no avaialble stuff in the region
  • Need to be able to make sure you know what you are running and if it i seffecintly used
  • Open sourced tools
    • mysql_utils – lots of tools to manage many DBs
    • Thrift tools
    • Teletraan – open sourced in Feb 2016
    • github.com/pinterest
Share

Linux.conf.au 2016 – Tuesday – Keynote: George Fong

George Fong – Chair of Internet Australia

The Challenges of the Changing Social Significance of the Nerd

  • “This is the first conference I’ve been to where there’s an extremely high per capita number of ponytails”
  • Linux not just running web server and other servers  but also network devices
  • Linux and the Web aren’t the same thing, but they’ve grown symbiotically and neither would be the same without the other
  • “One of the lessons we’ve learned in Australia is that when you mix technology with politics, you get into trouble”
  • “We have proof in Australia that if you take guns away from people, people stop getting killed”

 

Share

Linux.conf.au 2016 – Monday – Session 3

Cloud Anti-Patterns – Casey West

  • The 5 stages of Cloud Native
  • Deploying my apps to the cloud is painful – why?
  • Denial
    • “Containers are like tiny VMs”
    • Anti-Pattern 1 – do not assume what you have now is what you want to put into the cloud or a container
    • “We don’t need to automate continuous delivery”
    • We shouldn’t automate what we have until it is perfect. Automate to make things consistent (not always perfect at least at the start)
  • Anger
    • “works on my machine”
    • Dev is just push straight from dev boxes to production
    • Not about making worse code go to production faster
    • Aim to repeatable testable builds, just faster
  • Bargaining
    • “We crammed the monolith into a container and called it a microservice”
    • Anti-Pattern: Critically think on what you need to re-factor (or “re-platforming” )
    • ” Bi-modal IT “
    • Some stuff on fast lane, some stuff on old-way slow lane
    • Anti-pattern: leagacy products put into slow lane, these are often the ones that really need to be fixed.
    • “Micros-services” talking to same data-source, not APIs
  • Depression
    • “200 microservices but forgot to setup Jenkins”
    • “We have an automated build pipeline but online release twice per year”
  • Acceptance
    • All software sucks, even the stuff we write
    • Respect CAP theorem
    • Respect Conway’s Law
    • Small batch sizes works for replatforming too
  • Microservices architecture, Devops culture, Continuous delivery – Pick all three

Cloud Crafting – Public / Private / Hybrid  – Steven Ellis

  • What does Hybrid mean to you?
  • What is private Cloud (IAAS)
  • Hybrid – communicate to public cloud and manage local stuff
  • ManageIQ – single pain of glass for hardware, vms, clounds, containers
  • What does it do?
    • Brownfields as well as Greenfields, gathers current setup
    • Discovery, API presentations, control and detect when env non-complient (eg not fully patched)
    • Premise or public cloud
    • Supplied as a virtual appliance, HA, scale out
    • Platform – Centos 7, rails, postgress, gui, some dashboards our of the box.
  • Get involved
    • Online, roadmap is public
    • Various contributors
  • DEMO
  • Just put in credentials to allow access and then it can gather the data straiht away

Live Migration of Linux Containers by Tycho Andersen

  • LXC / LXD
  • LXD is a REST API that you use to control the container system
  • tool -> RST -> Daemon -> lxc -> Kernel
  • “lxc move host1:c1 host2: ” – Live migrations
    • Needs a bit of work since lots moving, lots of ways it could fail
    • 3 channels created, control, filesystem, container processes state
  • CRIU
    • 5 years of check-pointing
    • Lots based off open-VZ initial work
    • All sorts of things need to support check-pointing and moving (eg selinux)
    • Iterative migration added
    • Lots of hooks needed for very privileged kernel features
  • Filesystems
    • btrfs, lvm, zfs, (swift, nfs), have special support for migration that it hooks into
    • rsync between incompatable hosts
  • Memory State
    • Stop the world and move it all
    • Iterative incremental transfer (via p.haul) being worked on.
  • LXC + LXD 2.0 should be in Ubuntu 16.04 LTS
  • Need to use latest versions and latest kernels for best results.
Share

Linux.conf.au 2016 – Monday – Session 1

Open Cloud Miniconf – Continuous Delivery using blue-green deployments and immutable infrastructure by Ruben Rubio Rey

  • Lots of things can go wrong in a deployment
  • Often hard to do rollbacks once upgrade happens
  • Blue-Green deployment is running several envs at the same time, each potentially with different versions
  • Immutable infrastructure , split between data (which changes) and everything else only gets replaced fully by deployments, not changed
  • When you use docker don’t store data in the container, makes it immutable. But containers are not required to do this.
  • Rule 1 – Never modify the infrastructure
  • Rule 2 – Instead of modifying – always create from ground up everything that is not data.
  • Advantages
    • Rollbacks easy
    • Avoid Configuration drift
    • Updated and accurate infrastructure documentation
  • Split things up
    • No State – LBs, Web servers, App Servers
    • Temp data , Volatile State – message queues, email servers
    • Persistent data – Databases, Filesystems, slow warming cache
  • In case of temp data you have to be able to drain
  • USe LBs and multiple servers to split up infrastructure, more bit give more room to split up the upgrades.
  • If pending jobs require old/new version of app then route to servers that have/not been upgraded yet.
  • Put toy rocket launcher in devs office, shoots person who broke the build.
  • Need to “use activity script” to bleed traffic off section of the “temp data” layer of infrastructure, determine when it is empty and then re-create.
Share

Priorities for 2016

This is a almost New Years resolutions page but not quite. It is a list of the stuff that will take priority over other things in 2016

  • Chess – Aim to play regularly in tournaments, do weekly coaching and study at least 7 hours per week on tactics, endgames and openings.
  • Programming – Continue improving my programming skills, finish the book I am on, do a few exercises and create a few things
  • Blogging – At least 1 post each month to both my personal blog and the Auckland Chess Centre website
  • Driving – Get my Restricted Driver License
  • Reading – Read books (not online) at least half an hour per day
  • Health – 7500 steps every weekday plus get to goal weight
  • Conference – Run successful Sysadmin Miniconf at Linux.conf.au 2016

Stretch Goals – If I am keeping up with the above

  • Start working my way through Shakespeare’s plays
  • Do a couple of new website projects I’ve been putting off
  • Watch a 2-3 of hours of TV each week.
Share

Studying for Driver license test with Anki

In 2014 I decided to do a bit or work to finally get my New Zealand driver license. The first step towards this was passing the theory test which is a 35 question test given on computer. You have to get at least 32 questions right to pass.

After spending a bit of time looking at the roadcode book I decided to go with just learning the questions. I did this by:

  1. Buying some of the official practice exams
  2. Grabbing other questions for unofficial sites
  3. Entering some other questions manually from the books

I took all these questions and created a Anki Deck. Anki is some spaced repetition software that I use to learn things. I tell it to ask me a few new questions every day, if I get them wrong it asks me again tomorrow, if I get them right it asks me again next week. Gradually as I learn something it asks me less often (see the more technical explanation here)

A typical question on an Anki deck looks like these screenshots:

Screenshot_2015-12-10-21-05-24 Screenshot_2015-12-10-21-04-24The left on the left shows me being asked the question. Once I pick my answer I look at the actual answer (see rightmost screenshot)

If I get it wrong I get the card again in 10 minutes and depending on how easy I judged it if I got it right I’ll only see it again in months.

I ended up entering just on 400 questions and told Anki to give me 5 new cards every day plus whatever old ones I had to review. After a few months I had gone though all the questions and had a good feel for them. I also did some of the official practice exams.

Eventually in December 2014 I sat the exam and got 100 percent correct.

I’ll make my deck available at the link below. There are just over 400 cards in it, some with pictures. There are a few duplications but no errors as far as I am aware. They are current as of late 2014 (including the give-way rules change that year).

To use them you’ll need a copy of Anki and it is probably easiest to use the desktop edition to import the file and then use an Ankiweb account to Synchronize to a copy on your phone.

Download NZ Driver license Theory Anki Deck (2MB .apkg file)

Share

Donations 2015

Up until a couple of years ago my main charity was a regular payment to Oxfam. However I cancel this after I decided I disliked their fund-raising methods and otherwise read they were probably not in the top few percent of charities. Since then I’ve been tending to do things all in one go.

I just finished doing this year’s so I thought I’d document it here. It does feel a little weird to post about it but I’ve seen others do it. The theory I guess is that you the reader might be convinced that giving to charity is a good thing and do likewise.

My main donation was to the the top four charities rated by GiveWell:

  • Against Malaria Foundation                   $US 150
  • Schistosomiasis Control Initiative         $US 150
  • Deworm the World Initiative                  $US 150
  • GiveDirectly                                                 $US 150

Next were a series of Open Source projects

  • Debian                                                              $US 50
  • Freedesktop.org                                              $US 30
  • LibreOffice                                                       $US 30
  • OpenBSD                                                          $US 30
  • Python                                                              $US 30
  • Gnome                                                              $US 30

Interestingly enough I hadn’t originally intended to donate to LibreOffice and Freedesktop.org but Debian handles donations via Software in the Public Interest and those two showed up on the same donation page.

and some others

I thought about a few others including The Internet Archive, Anki and Mozilla. Perhaps next year

Share

OSCON 2015

No, I didn’t attend 🙁

But I had a look though the list of talks and read a tonne of slides. Here are found some interesting ones which I hope to watch when the videos go up.

See also:

Share