Keynote: Containers aka crazy user space fun
- Work at Microsoft on Open Source and containers, specifically on kubernetes
- Containers vs Zones vs Jails vs VMs
- Containers are not a first class concept in the kernel.
- Namespaces
- Cgroups
- AppArmour in LSM (prevent mounting, writing to /proc etc) (or SELinux)
- Seccomp (syscall filters, which allowed or denied) – Prevent 150 other syscalls which are uncommon or dangerous.
- Got list from testing all of dockerhub
- eg CLONE, UNSHARE
- NoNewPrivs (exposed as “AllowPrivilegeEsculation” in K8s)
- rkt and systemd-nspawn don’t 100% follow
- Intel Clear containers are really VMs
History of Containers
- OpenVZ – released 2005
- Linux-Vserver (2008)
- LXC ( 2008)
- Docker ( 2013)
- Initially used LXC as a backend
- Switched to libcontainer in v0.7
- lmctfy (2013)
- By Google
- rkt (2014)
- runc (2015)
- Part of Open container Initiative
- Container runtimes are like the new Javascript frameworks
Are Containers Secure
- Yes
- and I can prove it
- VMs / Zones and Jails are like all the Lego pieces are already glued togeather
- Containers you have the parts seperate
- You can turn on and off certain namespaces
- You can share namespaces between containers
- Every container in k8s shares PID and NET namespaces
- Docker has sane defaults
- You can sandbox apps every further though
- https://contained.af/
- No one has managed to break out of the container
- Has a very strict seccomp profile applied
- You’d be better off attacking the app, but you are still running a containers default seccomp filters
Containerizing the Desktop
- Switched to runc from docker (had to convert stuff)
- rootless containers
- Runc hook “netns” to do networking
- Sandboxed desktop apps, running in containers
- Switch from Debian to CoreOS Container Linux as base OS
- Verify the integrity of the OS
- Just had to add graphics drivers
- Based on gentoo, emerge all the way down
What if we applied the the same defaults to programming languages?
- Generate seccomp filters at build-time
- Previously tried at run time, doesn’t work that well, something always missed
- At build time we can ensure all code is included in the filter
- The go compiler writes the assembly for all the syscalls, you can hijack and grab the list of these, create a seccomp filter
- No quite that simply
- plugins
- exec external stuff
- can directly exec a syscall in go code, the name passed in via arguments at runtime
- metaparticle.io
- Library for cloud-native applications
Linux Containers in secure enclaves (SCONE)
- Currently Slow
- Lots of tradeoffs or what executes where (trusted area or untrsuted area)
Soft multi-tenancy
- Reduced threat model, users not actively malicious
- Hard Multi-tenancy would have potentially malicious containers running next to others
- Host OS – eg CoreOs
- Container Runtime – Look at glasshouse VMs
- Network – Lots to do, default deny in k8s is a good start
- DNS – Needs to be namespaced properly or turned off. option: kube-dns as a sidecar
- Authentication and Authorisation – rbac
- Isolation of master and System nodes from nodes running containers
- Restricting access to host resources (k8s hostpath for volumes, pod security policy)
- making sure everything else is “very dumb” to it’s surroundings