Keynote: Containers aka crazy user space fun
- Work at Microsoft on Open Source and containers, specifically on kubernetes
 - Containers vs Zones vs Jails vs VMs
 - Containers are not a first class concept in the kernel.
- Namespaces
 - Cgroups
 - AppArmour in LSM (prevent mounting, writing to /proc etc) (or SELinux)
 - Seccomp (syscall filters, which allowed or denied) – Prevent 150 other syscalls which are uncommon or dangerous.
- Got list from testing all of dockerhub
 - eg CLONE, UNSHARE
 - NoNewPrivs (exposed as “AllowPrivilegeEsculation” in K8s)
 - rkt and systemd-nspawn don’t 100% follow
 
 
 - Intel Clear containers are really VMs
 
History of Containers
- OpenVZ – released 2005
 - Linux-Vserver (2008)
 - LXC ( 2008)
 - Docker ( 2013)
- Initially used LXC as a backend
 - Switched to libcontainer in v0.7
 
 - lmctfy (2013)
- By Google
 
 - rkt (2014)
 - runc (2015)
- Part of Open container Initiative
 
 - Container runtimes are like the new Javascript frameworks
 
Are Containers Secure
- Yes
 - and I can prove it
 - VMs / Zones and Jails are like all the Lego pieces are already glued togeather
 - Containers you have the parts seperate
- You can turn on and off certain namespaces
 - You can share namespaces between containers
 - Every container in k8s shares PID and NET namespaces
 - Docker has sane defaults
 - You can sandbox apps every further though
 
 - https://contained.af/
- No one has managed to break out of the container
 - Has a very strict seccomp profile applied
 - You’d be better off attacking the app, but you are still running a containers default seccomp filters
 
 
Containerizing the Desktop
- Switched to runc from docker (had to convert stuff)
 - rootless containers
 - Runc hook “netns” to do networking
 - Sandboxed desktop apps, running in containers
 - Switch from Debian to CoreOS Container Linux as base OS
- Verify the integrity of the OS
 - Just had to add graphics drivers
 - Based on gentoo, emerge all the way down
 
 
What if we applied the the same defaults to programming languages?
- Generate seccomp filters at build-time
- Previously tried at run time, doesn’t work that well, something always missed
 - At build time we can ensure all code is included in the filter
 - The go compiler writes the assembly for all the syscalls, you can hijack and grab the list of these, create a seccomp filter
 - No quite that simply
- plugins
 - exec external stuff
 - can directly exec a syscall in go code, the name passed in via arguments at runtime
 
 
 - metaparticle.io
- Library for cloud-native applications
 
 
Linux Containers in secure enclaves (SCONE)
- Currently Slow
 - Lots of tradeoffs or what executes where (trusted area or untrsuted area)
 
Soft multi-tenancy
- Reduced threat model, users not actively malicious
 - Hard Multi-tenancy would have potentially malicious containers running next to others
 - Host OS – eg CoreOs
 - Container Runtime – Look at glasshouse VMs
 - Network – Lots to do, default deny in k8s is a good start
 - DNS – Needs to be namespaced properly or turned off. option: kube-dns as a sidecar
 - Authentication and Authorisation – rbac
 - Isolation of master and System nodes from nodes running containers
 - Restricting access to host resources (k8s hostpath for volumes, pod security policy)
 - making sure everything else is “very dumb” to it’s surroundings