DevOps Days Auckland 2017 – Tuesday Session 2

Using Bots to Scale incident Management – Anthony Angell (Xero)

  • Who we are
    • Single Team
    • Just a platform Operations team
  • SRE team is formed
    • Ops teams plus performance Engineering team
  • Incident Management
    • In Bad old days – 600 people on a single chat channel
    • Created Framework
    • what do incidents look like, post mortems, best practices,
    • How to make incident management easy for others?
  • ChatOps (Based on Hubot)
    • Automated tour guide
    • Multiple integrations – anything with Rest API
    • Reducing time to restore
    • Flexability
  • Release register – API hook to when changes are made
  • Issue report form
    • Summary
    • URL
    • User-ids
    • how many users & location
    • when started
    • anyone working on it already
    • Anything else to add.
  • Chat Bot for incident
    • Populates for an pushes to production channel, creates pagerduty alert
    • Creates new slack channel for incident
    • Can automatically update status page from chat and page senior managers
    • Can Create “status updates” which record things (eg “restarted server”), or “yammer updates” which get pushed to social media team
    • Creates a task list automaticly for the incident
    • Page people from within chat
    • At the end: Gives time incident lasted, archives channel
    • Post Mortum
  • More integrations
    • Report card
    • Change tracking
    • Incident / Alert portal
  • High Availability – dockerisation
  • Caching
    • Pageduty
    • AWS
    • Datadog

 

Share

DevOps Days Auckland 2017 – Tuesday Session 1

DevSecOps – Anthony Rees

“When Anthrax and Public Enemy came together, It was like Developers and Operations coming together”

  • Everybody is trying to get things out fast, sometimes we forget about security
  • Structural efficiency and optimised flow
  • Compliance putting roadblock in flow of pipeline
    • Even worse scanning in production after deployment
  • Compliance guys using Excel, Security using Shell-scripts, Develops and Operations using Code
  • Chef security compliance language – InSpec
    • Insert Sales stuff here
  • ispec.io
  • Lots of pre-written configs available

Immutable SQL Server Clusters – John Bowker (from Xero)

  • Problem
    • Pet Based infrastructure
    • Not in cloud, weeks to deploy new server
    • Hard to update base infrastructure code
  • 110 Prod Servers (2 regions).
  • 1.9PB of Disk
  • Octopus Deploy: SQL Schemas, Also server configs
  • Half of team in NZ, Half in Denver
    • Data Engineers, Infrastructure Engineers, Team Lead, Product Owner
  • Where we were – The Burning Platform
    • Changed mid-Migration from dedicated instances to dedicated Hosts in AWS
    • Big saving on software licensing
  • Advantages
    • Already had Clustered HA
    • Existing automation
    • 6 day team, 15 hours/day due to multiple locations of team
  • Migration had to have no downtime
    • Went with node swaps in cluster
  • Split team. Half doing migration, half creating code/system for the node swaps
  • We learnt
    • Dedicated hosts are cheap
    • Dedicated host automation not so good for Windows
    • Discovery service not so good.
    • Syncing data took up to 24h due to large dataset
    • Powershell debugging is hard (moving away from powershell a bit, but powershell has lots of SQL server stuff built in)
    • AWS services can timeout, allow for this.
  • Things we Built
    • Lots Step Templates in Octopus Deploy
    • Metadata Store for SQL servers – Dynamite (Python, Labda, Flask, DynamoDB) – Hope to Open source
    • Lots of PowerShell Modules
  • Node Swaps going forward
    • Working towards making this completely automated
    • New AMI -> Node swap onto that
    • Avoid upgrade in place or running on old version
Share

Linux.conf.au 2017 – Friday – Closing

Code of Consult and Safety

  • Badge
    • Putting prefered pronoun
    • Emoji
  • Free Childcare
    • Sponsored by Github
    • Approx 10 kids
  • Assistance Grants
  • Attendees
    • Breakdown by gender etc
    • Roughly 25% of attendees and speakers not men
  • More numbers
    • 104 Matrix chat users
    • 554 attendees
    • 2900 coffee cups
    • Network claimed to 7.5Gb/s
    • 1.6 TB over the week, 200Mb/s max
    • 30 Session Chairs
    • 12 Miniconfs
    • 491 Proposals (130 more than the others)
    • 6 Tutorials, 75 talks, 80 speakers
    • 4 Keynote speakers
    • 21 Sponsors

Linux.conf.au 2018 – Sydney

  • A little bit of history repeating
  • 2001, 2007, 2018
  • Venue is UTS
  • 5 minutes to food, train station
  • https://lca2018.org
  • @lca2018 on twitter
  • Looking for a few extra helpers

Raffle

  • In support of Outreachy
  • 3 interns funded

Final Bit

  • Thanks to team members

 

 

Share

Linux.conf.au 2017 – Friday – Lightning Talks

Use #lcapapers to tell Linux.conf.au what you want to see in 2018

Michael Still and Michael Davies get the Rusty Wrench award

Karaoke – Jack Skinner

  • Talk with random slides

Martin Krafft

  • Matrix
  • End to end encrypted communication system
  • No entity owns your conversations
  • Bridge between walled gardens (eg IRC and Slack)
  • In Very late Beta, 450K user accounts
  • Run or Write your own servers or services or client

Cooked – Pete the Pirate

  • How to get into Sous Vide cooking
  • Create home kit
  • Beaglebone Black
  • Rice cooker, fish tank air pump.
  • Also use to germinate seeds
  • Also use this system to brew beer

Emoji Archeology 101 – Russell Keith-Magee

  • 1963 Happy face created
  • 🙂 invented
  • later 🙁 invented
  • Only those emotions imposed by the Unicode consortium can now be expressed

The NTPsec Project – Mark Atwood

  • Since 2014
  • For and git in 2015 from parent ntp project
  • 1.0.0 release soon
  • Removed 73% of lines from classic
    • Removed commandline tools
    • Got write of stuff for old OSes
    • Changed to POSIX and modern coding
    • removed experiments
  • Switch to git and bugzilla etc
  • Fun not painful
  • Welcoming community, not angry
  • ntpsec.org

National Computer Science Summer School – Katie Bell

  • Running for 22 years
  • Web stream, Embedded Stream
  • Using BBC Microbit
  • Lots of projects
  • Students in grade 10-11
  • Happens in January
  • Also 5 week long online programming competition NCSS Competition.

Blockchain – Rusty Russell

  • Blockchain
  • Blockchain
  • Blockchain

Go to Antarctica – Jucinter Richardson

  • Went Twice
  • Go by ship
  • No rain
  • Nice and cool
  • Join the government
  • Positions close
  • Go while it is still there

Cool and Awesome projects you should help with – Tim Ansell

  • Tomu Boards
  • MicroPython on FPGAs
  • Python Devicetree – needs a good library
  • QEMU for LiteX / MiSoC
  • NuttX for LiteX / MiSoC
  • QEMU for Tomu
  • Improving LiteX / MiSoc
  • Sypress FX2
  • Linux to LiteX / MiSoC
  • DMMI2USB
  • j.mp/timpro-lca2017

LoRa TAS – Paul Neumeyer

  • long range (2-3km urban 10km rural)
  • low power (batter ~5 years)
  • Unlicensed radio spectrum 915-928 Mhz BAnd (AUS)
  • LoRaWAN is an open standard
  • Ideal for IoT applications (sensing, preventative maintenance, smart)

Roan Kattatow

  • Different languages mix dots and commas and spaces etc to write numbers

ZeroSkip – Ron Gondwana

  • Crash safe embeded database
  • Not fast enough
  • Zeroskip
  • Append only database file
  • Switch files now and then
  • Repack old files togeather

PyCon Au – Richard Jones

  • Python Conference Australia
  • 7th in Melbourne in Aug 2016 – 650 people, 96 presentation
  • In Melb on 308 of August on 2016
  • 2017.pycon-au.org

Buying a Laptop built for Linux – Paul Wayper

  • Bought from System76
  • Designed for Linux

openQA – Aleksa Sarai

  • Life is too short for manual testing
  • Perl based framework that lets you emulate a user
  • Runs from console, emulates keyboard and mouse
  • Has screenshots
  • Used by SUSE and openSUSE and fedora
  • Fuzzy comparison, using regular expressions
  • open.qa

South Coast Track – Bec, Clinton and Richard

  • What I did in the Holidays
  • 6 day walk in southern tasmania
  • Lots of pretty photos
Share

Linux.conf.au 2017 – Friday – Session 2

Continuously Delivering Security in the Cloud – Casey West

  • This is a talk about operation excellence
  • Why are system attacked? Because they exist
  • Resisting Change to Mitigate Risk – It’s a trap!
  • You have a choice
    • Going fast with unbounded risk
    • Going slow to mitigate risk
  • Advanced Persistent Threat (ATP) – The breach that lasts for months
  • Successful attacks have
    • Time
    • Leaked or misused creditials
    • Miconfigured or unpatched software
  • Changing very little slowly helps all three of the above
  • A moving target is harder to hit
  • Cloud-native operability lets platforms move faster
    • Composable architecture (serverless, microservices)
    • Automated Processes (CD)
    • Collaborative Culture (DevOps)
    • Production Environment (Structured Platform)
  • The 3 Rs
    • Rotate
      • Rotate credentials every few minutes or hours
      • Credentials will leak, Humans are weak
      • “If a human being generates a password for you then you should reject it”
      • Computers should generate it, every few hours
    • Repave
      • Repave every server and application every few minutes/hours
      • Implies you have things like LBs that can handle servers adding and leaving
      • Container lifecycle
        • Built
        • Deploy
        • Run
        • Stop
        • Note: No “change “step
      • A Server that doesn’t exist isn’t being cromprimised
      • Regularly blow away running containers
      • Repave ≠ Patch
      • uptime <= 3600
    • Repair
      • Repair vulnerable runtime environments every few minutes or hours
      • What stuff will need repair?
        • Applications
        • Runtime Environments (eg rails)
        • Servers
        • Operating Systems
      • The Future of security is build pipelines
      • Try to put in credential rotation and upsteam imports into your builds
  • Embracing Change to Mitigate Risk
  • Less of a Trap (in the cloud)
Share

Linux.conf.au 2017 – Friday – Session 1

Adventures in laptop battery hacking -Matthew Chapman

  • Lenovo Thinkpad X230T
    • Bought Aug 2013
    • Ariginal capacity 62 KWh – 5hours and 12W
    • Capacity down to 1.9Wh – 10 minutes
  • 45N1079 replacement bought
    • DRM on laptop claimed it was not genuine and refused to recharge it.
  • Batteries talk SBS protocol to laptop
  • SMBus port and SMClock port
    • sniffed the port with logic analyser
    • Using I2C protocol
    • Looked at spec to see what it means
    • Challenge-response authentication
  • Options
    1. Throw Away
    2. Replace Cells
      • Easy to damage
      • Might not work
    3. Hack firmware on battery
      • Talk at DEFCON 19
      • But this is different model from that
      • Couldn’t work out how to get to firmware
    4. Added something in between
    5. Update the firmware on the machine
      • Embeded Controller (EC)
      • MEC1619
  • Looking though the firmware for Battery Authentication
    • Found routine that look plausable
    • But other stuff was encrypted
  • EC Update process
    • BIOS update puts EC update in spare flash memory area
    • After the BIOs grabs that and applies update
  • Pulled apart the BIOs, found EcFwUpdateDxe.efi routine that updates the EC
    • Found that stuff send to the EC still encrypted.
    • Unencryption done by flasher program
  • Flasher program
    • Encrypted itself (decrypted by the current fireware)
    • JTAG interface for flashing debug
  • JTAG
    • Physically difficult to get to
    • Luckily Russian Hackers have already grabbed a copy
  • The Decryption function in the Flasher program
    • Appears to be blowfish
    • Found the key (in expanded form) in the firmware
    • Enough for the encryption and decryption
  • Checksums
    • Outer checksum checked by BIOs
    • Post-decryption sum – checked by the flasher (bricks EC if bad)
    • Section Echecksums (also bricks)
  • Applying
    • noop the checks in code
    • noop another check that sometimes failer
    • Different error message
  • Found a second authentication process
    • noop out the 2nd challenge in the BIOs
  • Works!
  • Posted writeup, posted to hacker news
    • 1 million page views
  • Uploaded code to github
    • Other people doing stuff with the embedded controller
    • No longer works on latest laptops, EC firmware appears to be signed
  • Anything can be broken with physical access and significant determination

Election Software – Vanessa Teague

  • Australian Elections use a lot of software
    • Encoding and counting preferential votes
    • For voting in polling places
    • For voting over the internet
  • How do we know this software is correct
  • The Paper ballot box is engineered around a serious of problems
    • In the past people bought their own voting paper
    • The Australian Ballot used in many places (eg NZ)
    • Franch use different method with envelopes and glass boxes
    • The US has had lots of problems and different ways
  • Four cases studies in Aus
  • vVote: Victoria
    • Vic state election 2014
    • 1121 votes for overseas Australians voting in Embassies etc
    • Based on Pret a Voter
    • You can varify that what you voted was what went though
    • Source code on bitbucket
    • Crypto signed, varified, open source, etc
    • Not going forward
    • Didn’t get the electoral commissions input and buy-in.
    • A little hard to use
  • iVote: NSW and WA
    • 280,000 votes over Internet in 2015 NSW state election ( around 5-6% of total votes)
    • Vote on a device of your choosing
    • Vote encrypted and send over Internet
    • Get receipt number
    • Exports to a varification service. You can telephone them, give them your number and they will read back you votes
    • Website used 3rd-party analytics provider with export-grade crypto
      • Vulnerable to injection of content, votes could be read or changed
      • Fixed (after 66k votes cast)
    • NSW iVote really wasn’t varifiable
    • About 5000 people called into service and successfully verified
    • How many tried to verify but failed?
    • Commission said 1.7% of electors verified and none identified any anomalies with their vote (Mar 2015)
    • How many tried and failed? “in the 10s” (Oct 2015)
    • Parliamentary said how many failed? Seven or 5 (Aug 2016)
    • How many failed to get any vote? 627 (Aug 2016)
    • This is a failure rate of about 10%
    • It is believed it was around 200 unique (later in 2016)
  • Vote Counting software
  • Errors in NSW counting
    • NSW legislative voting redistributed votes are selected at random
    • No source code for this
    • Use same source code for lots of other elections
    • Re-ran some of the votes, found randomness could change results. Found one most likely cost somebody a seat, but not till 4 years later.
  • Recomended
    • Generate the random key publicly
    • Open up the source code
    • They electorial peopel didn’t want to do this.
  • In the 2016 localgovt count we found 2 more bugs
    • One candidate should have won with 54% probability but didn’t
  • The Australian Senate Count
  • AEC consistent refuses to revel the source code
  • The Senate Date is release, you can redo it yourself any bugs will become evident
  • What about digitising the ballots?
    • How would we know if that wasn’t working?
    • Only by auditing the paper evidence
  • Auditing
    • The Americas have a history or auditing the paper ballots
    • But the Australian vote is a lot more complex so everything not 100% yet
    • Stuff is online

 

Share

Linux.conf.au 2017 – Friday Keynote – Robert Lefkowitz

Keeping Linux Great

  • Previous Keynotes have posed question I’ll pose answers
  • What is the free of open source software, it has no future
  • FLOSS is yesterday’s gravy
    • Based on where the technology is today. How would FLOSS work with punch cards?
    • Other people have said similar things
    • Software, Linux and similar all going down in google trends
    • But “app” is going up
  • Lithification
    • Small pieces losely joined
    • Linux used to be great could you could pipe stuff to little programs
    • That is what is happening to software
    • Example – share a page to another app in a mobile interface
    • All apps no longer need to send mail, they just have to talk to the mail app
  • So What should you do?
    • Vendor all you dependencies, just copy everyone elses code into your repo (and list their names if it is BSD) so you can ship everything in one blob (eg Android)
      • Components must be 5> million or >20 LOC , only a handful or them
      • At the other end apps are smaller since they can depend on the OS or other Apps for lots of functionality so they don’t have to write it themselves.
      • Example node with thousands of dependencies
  • App Freedom
    • “Advanced programming environments conflate the runtime with the devtime” – Bret Victor
    • Open Source software rarely does that
    • “It turns out that Object Orientation didn’t work out, it is another legacy with are stuck with”
    • Having the source code is nice but it is not a requirement. Access to the runtime is what you want. You need to get it where people are using it.
  • Liberal Software
  • But not everything wasn’t to be a programmer
    • 75% comes from 6 generic web applications ( collection, storage, reservation, etc)
  • A lot of functionality requires big data or huge amounts of machines or is centralised so open sourcing the software doesn’t do anything useful
  • If it was useful it could be patented, if it was not useful but literary then it was just copyright
Share

Linux.conf.au 2017 – Thursday – Session 3

Open Source Accelerating Innovation – Allison Randal

  • Story of Stallman and the printer
  • Don’t talk about the story of the context
    • Stallman was living in a free software domain, propriety software was creeping in
    • Software only became subject to copyright in early 80s
  • First age of software – 1940s – 1960s
    • Software was low value
    • Software was all free and open, given away
  • Precursor – The 1970s
  • Middle Age of Software – 1980s
    • Start of Windows, Mac, Oracle and other big software companies
    • Also start of GNU and BSD
    • Who Leads?
      • Propritory software was seen as the innovator and always would be.
      • Free Software was seen to be always chasing after windows
  • The 2000s
    • Free Software caught up with Propritory
    • Used by big companies
    • “Open Source” name adopted
    • dot-com bubble had burst
    • Web 2.0
    • Economic necessity, everyone else getting it for free
    • Collaborative Process – no silver bullet but a better chance
    • Innovations lead by open source
  • Software Freedoms
    • About Control over our material enviroment
    • If you don’t other freedoms then you don’t have a free society
  • Modern Age of Software
    • Accelerating
    • Cops in 2010 42% used OS software,  In 2015 78% using
    • Using Open Source is now just table stakes
    • Competitive edge for companies is participating is OS
    • Most participation pushes innovation even faster
  • Now What?
    • The New innovative companies
      • Amazing experiences
      • Augment Workers
      • Deliver cool stuff to customers
      • Use Network effects, Brand names
    • Businesses making contribution to society
    • Need to look at software that just doesn’t cover commercial use cases.
  • Next Phase
    • Diversity
    • Myopic monocultures – risk cause they miss the dangers
    • empowered to change the rule for the better

Surviving the Next 30 Years of Free Software – Karen M. Sandler

  • We’re not getting any younger
  • Software Relicensing
    • Need to get approval of authors to re-license
    • Has had to contact surviving spouse and get them to agree to re-license the code
    • One survivor wanted payment. Didn’t understand that code would be written out of the project.
  • There are surely other issues that that we have no considered
  • Copyright Assignment is a way around it
    • But not everybody likes that.
  • Bequeathment doesn’t work
    • In some jurisdictions copyrights have to assessed for their value before being transferred. Taxes could be owed
  • Who is your next of Kin?
    • They might share your OS values or even think of them
  • Need perpetual care of copyrights
    • Debian Copyright Aggregation Projects
  • A Trust
    • Assign copyrights today, will give you back the rights you want but these expire on your death
    • Would be a registry for free software
    • Companies could participate to
  • Recognize the opportunity with age
    • A lot of people with a lot of spare time

 

Share

Linux.conf.au 2017 – Thursday – Session 2

Content as a driver of change: then and now – Lana Brindley

  • Humans have always told stories
  • Cave Drawings
    • Australian Indigenous art is the oldest continuous art in the world
    • Stories of extinct mega-fauna
    • Stories of morals but sometimes also funny
  • Early Written Manuals
    • We remember the Eureka
  • Religious Leaders
    • Gutenburg
    • Bible was only redistributed book, restricted to clergy
  • Fairy Tales
    • Charles Perrault versions.
    • Brother Grim
    • Cautionary tales for adults
    • Very gruesome in the originals and many versions
    • Easiest and entertaining way for illiterate people to share moral stories
  • Master and Apprentice
    • Cheap Labour and Learn a Trade
  • Journals and Letters
    • In the early 19th century letter writing started happoning
    • Recipe Books

 

  • Recently
  • Paper Manuals
    • Traditionally the proper method for technical docs
  • Whitepapers
    • Printed version will probably go away
    • Digital form may live on
  • Training Courses
    • Face to face training has it’s benifits
    • Online is where techical stuff is moving
  • Online Books
    • Online version of a printed book
    • Designed to be read from beginning to end, TOC, glossary, etc

 

  • Today
  • MOOCS
    • Quite common
  • Data Typing (DITA)
    • Break down the content into logical pices
    • Store in a database
    • Mix on the fly
    • Doing this sort of the since 1960s and 1970s
  • Single Sourcing
    • Walked away from old idea of telling a story
    • Look at how people consumed and learnt difficult concepts
    • Deliver the same content many ways (beginner user, advanced, reference)
    • Chunks of information we can deliver however we like
  • User-Side Content Curation
    • Organised like a wikipedia article
    • Imagine a side listing lots of cars for sale, the filters curate the content
  • What comes next?
    • Large datasets and let people filter
    • Power going from producers to consumers
    • Consumers want to filter themselves, not leave the producers to do this
  • References and further reading for talk

I am your user. Why do you hate me? Donna Benjamin

  • Free and open source software suffers from poor usability
  • We’ve struggled with open source software, heard devs talk about users with contempt
  • We define users by what they can’t do
  • How do I hate thee let I count the ways
    • Why were we being made to feel stupid when we used free software
    • Software is “made by me for me”, just for brainiac me
    • Lots of stories about stupid users. Should we be calling our users stupid?
    • We often talk/draw about users as faceless icons
    • Take pride in having prickly attitudes
  • Users
    • Whiney, entitled and demanding
    • We wouldn’t want some of them as friends
    • Not talk about those sort of users
  • Lets Chat about chat
    • Slack – used by OS projects, not the freest, propritory
    • Better in many ways less friction, in many ways
  • Steep Learning curves
    • How long to get to the level of (a) Stop hating it? (b) Are Kicking ass
    • How do we get people over that level as quickly as possible
    • They don’t want to be badass at using your tool. They want you to be badass at what using your tool allows them to do
    • Badass: Making Users Awesome – Kathy Sierra
  • Perfect is the enemy of the good
  • Understand who your users are; see them as people like your friends and colleagues; not faceless icons

 

Share

Linux.conf.au 2017 – Thursday – Session 1

The Vulkan Graphics API, what it means for Linux – David Airlie

  • What is Vulkan
    • Not OpenGL++
    • From Scratch, Low Level, Open Graphics API
    • Stack
      • Loader (Mostly just picks the driver)
      • Layers (sometimes optional) – Seperate from the drivers.
        • Validation
        • Application Bug fixing
        • Tracing
        • Default GPU selection
      • Drivers (ICDs)
    • Open Source test Suite. ( “throw it over the wall Open Source”)
  • Why a new 3D API
    • OpenGL is old, from 1992
    • OpenGL Design based on 1992 hardware model
    • State machine has grown a lot as hardware has changed
    • Lots of stuff in it that nobody uses anymore
    • Some ideas were not so good in retrospec
      • Single context makes multi-threading hard
      • Sharing context is not reliable
      • Orientated around windows, off-screen rendering is a bolt-on
      • GPU hardware has converged to just 3-5 vendors with similar hardware. Not as much need to hid things
    •  Vulkan moves a lot of stuff up to the application (or more likely the OS graphics layer like Unity)
    • Vulkan gives applications access to the queues if they want them.
    • Shading Language – SPIR-V
      • Binary formatted, seperate from Vulkan, also used by OpenGL
      • Write Shaders HSL or GLSL and they get converted to SPIR-V
    • Driver Development
      • Almost all Error checking needed since done on the validation layer
      • Simpler to explicitly build command stream and then submit
    • Linux Support
      • Closed source Drivers
        • Nvidia
        • AMD (amdgpu-pro) – promised open source “real soon now … a year ago”
      • Open Source
        • Intel Linux (anv) –
          • on release day. 3.5 people over 8 months
          • SPIR -> NIR
          • Vulkan X11/Wayland WSI
          • anv Vulkan <– Core driver, not sharable
          • NIR -> i965 gen
          • ISL Library (image layout/tiling)
        • radv (for AMD GPUs)
          • Dave has been working on it since early July 2016 with one other guy
          • End of September Doom worked.
          • One Benchmark faster than AMD Driver
          • Valve hired someone to work on the driver.
          • Similar model to Intel anv driver.
          • Works on the few Vulkan games, working on SteamVR

 

Building reliable Ceph clusters – Lars Marowsky-Brée

  • Ceph
    • Storage Project
    • Multiple front ends (S3, Swift, Block IO, iSCSI, CephFS)
    • Built on RADOS data store
    • Software Defined Storage
      • Commodity servers + ceph + OS + Mngt (eg Open Attic)
      • Makes sense at 4+ servers with 10 drives each
      • metadata servce
      • CRUSH algorithm to speread out the data, no centralised table (client goes directly to data)
    • Access Methods
      • Use only what you need
      • RADOS Block devices   <– most stable
      • S3 (or Swift) via RadosGW  <– Mature
      • CephFS  <— New and pretty stable , avoid stuff non meta-data intensive
    • Introducing Dependability
      • Availability
      • Reliability
        • Duribility
      • Safety
      • Maintainability
    • Most outages are caused by Humans
    • At Scale everything fails
      • The Distributed systems are still vulnerable to correlated failures (eg same batch of hard drives)
      • Advantages of Heterogeneity – Everything is broken different
      • Homogeneity is non-sustainable
    • Failure is inevitable; suffering is optional
      • Prepare for downtime
      • Test if system meets your SLA when under load and when degraded and during recovery
    • How much available do you need?
      • An extra nine will double your price
  • A Bag full of suggestions
    • Embrace diversity
      • Auto recovery requires a >50% majority
      • 3 suppliers?
      • Mix arch and stuff between racks/pods and geography
      • Maybe you just go with manually added recovery
    • Hardware Choices
      • Vendors have reference archetectures
      • Hard to get vendors to mix, they don’t like that and fewer docs.
      • Hardware certification reduces the risk
      • Small variations can have huge impact
        • Customer bought network card and switch one up from the ref architecture. 6 months of problems till firmware bug fixed.
    • How many monitors do I need?
      • Not performance critcal
      • 3 is usually enough as long as well distributed
      • Big envs maybe 5 or 7
      • Don’t coverge (VMs) these with other types of nodes
    • Storage
      • Avoid Desktop Disks and SSDs
    • Storage Node sizing
      • A single node should not be more than 10% of your capacity
      • You need space capacity at least as big as a single node (to recover after fail)
    • Durability
      • Erasure Encode more durabily and high percentage of disk used
      • But recovery a lot slower, high overhead, etc
      • Different strokes for different pools
    • Network cards, different types, cross connect, use last years cards
    • Gateways: tests okay under failure
    • Config drift: Use config mngt (puppet etc)
    • Monioring
      • Perf as system ages
      • SSD degradation
    • Updates
      • Latest software is always the best
      • Usually good to update
      • Can do rolling upgrades
      • But still test a little on a staging server first
      • Always test on your system
        • Don’t trust metrics from vendors
        • Test updates
        • test your processes
        • Use OS to avoid vendor lock in
    • Disaster will strike
      • Have backups and test them and recoveries
    • Avoid Complexity
      • Be aggressive in what you test
      • Be commiserative in what you deploy only what you need
    • Q: Minimum size?
    • A: Not if you can fit on a single server

 

Share