Containers and micro virtual machines
Freiburg, GermanyI wrote an article about my deep dive into containers last month. As part of this learning journey, I built a prototype named Yaman, an extremely limited yet functional container manager.
In today’s article, I introduce a new sub-project named microvm. It’s an experimental container runtime that uses short-lived Virtual Machines (VMs).
This isn’t forward-thinking, I developed this new prototype for learning purposes. Kata Containers, krunvm and crun (with the help of libkrun and libkrunfw) are production-grade technologies to run containers inside VMs for better isolation.
Exploratory work
Similar to how others have been running Docker containers inside QEMU
VMs, I started by tinkering with QEMU’s
microvm machine. I compiled a custom Linux kernel for it and
wrote my own init
1 program. The latter is needed to mimic what a
traditional container runtime does when creating a container from a bundle2.
The biggest unknown at the time was the root file system, which resided on the host but had to be used within the VM. Creating a disk image (e.g. qcow2) didn’t sound too great: it’s overly complicated and slow. Instead, I used virtiofsd, a daemon running on the host that can share a folder with a guest (VM) using virtio-fs.
For the Linux kernel, I created a minimal configuration with make
allnoconfig
and make menuconfig
to enable some options. I spent a lot of time
on this part because I wanted to understand what I was
configuring. I recompiled the kernel about 200 times according
to the kernel’s make
output 😅
Most kernel messages could be hidden with the kernel’s quiet
option except
when the machine rebooted. I really wanted to hide all messages unrelated to
the (container) process running in the VM so I did what most people would have
done in this situation: I patched the kernel.
Last but not least, I wrote a simple init
program to mount some
directories (similar to what I did in Yacr) and read some environment
variables in order to (1) set the hostname
and (2) execute the container
process. About the latter, the kernel doesn’t expect the special init
process
to be terminated, which necessarily happens when the container process has
finished its execution. I patched the kernel to reboot instead 🙈
The init
program reads environment variables because the Linux kernel passes
unknown kernel parameters to it3. This is super convenient because QEMU
lets us configure the kernel’s command line using the -append
option. This is
how the container process (defined in the container configuration)
is passed to the VM for instance.
At this point, I could start a VM using my own kernel and get an interactive shell 🎉
(host) $ make run BUNDLE=/tmp/alpine-bundle CID=alpine-ctr
/ # uname -a
Linux alpine-ctr 5.15.47-willdurand-microvm #1 SMP Fri Jul 9 19:08:45 UTC 2022 x86_64 Linux
/ # hostname
alpine-ctr
/ # ls
bin etc lib mnt proc run srv tmp var
dev home media opt root sbin sys usr
/ #
The custom kernel, the init
program and the QEMU/virtiofsd
configs seemed to
work fine. Phew! This was just QEMU running in my terminal, though.
Obviously, I couldn’t stop there…
The MicroVM runtime
I moved the few commands I had in a Makefile
to a new Go CLI
application named microvm, which partially implements the OCI runtime
specification. This highly experimental container runtime is fully integrated
with my own container manager.
Let’s consider the following example, which prints a short weather report using two containers:
$ echo 'wttr.in' \
| sudo yaman c run --rm --interactive docker.io/library/bash -- xargs wget -qO /dev/stdout \
| sudo yaman c run --interactive --runtime microvm quay.io/aptible/alpine -- head -n 7
Weather report: Brussels, Belgium
\ / Partly cloudy
_ /"".-. 17 °C
\_( ). ↘ 24 km/h
/(___(__) 10 km
0.0 mm
From a user perspective, the fact that the second container has been created with a virtual machine is an implementation detail, and that’s what I wanted to achieve with this work.
Under the hood, the following steps are performed:
- A first interactive container is created with the default runtime
(Yacr). This container uses the official Bash Docker
image. It reads the value
wttr.in
fromstdin
(that’s why it has to be interactive) and executeswget -qO /dev/stdout wttr.in
(which behaves likecurl wttr.in
butcurl
wasn’t installed in the Docker image). - The output of the first container is redirected to the second container. The
first container exits and gets automatically removed because
--rm
was specified. - The second container is also interactive but it uses the microvm runtime and
an Alpine image from Quay.io. When the microvm runtime is
invoked to create the container, it creates a virtual machine using QEMU and
spawns a
virtiofsd
process to share the root filesystem with this VM. - In the VM, the
init
process executeshead -n 7
, which returns the first 7 lines of data received onstdin
. The result is printed on the host’s standard output (stdout
) and the second container exits.
The --rm
option was not specified when we ran the second container, meaning we
can still retrieve it by listing all the containers. With the container ID, we
can fetch the logs and inspect the container (until we delete it):
$ sudo yaman c ls -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS
26127b728da94fd7a184549f2c0f586c quay.io/aptible/alpine:latest head -n 7 15 seconds ago Exited (0) 12 seconds ago
$ sudo yaman c logs 26127b728da94fd7a184549f2c0f586c
Weather report: Brussels, Belgium
\ / Partly cloudy
_ /"".-. +22(24) °C
\_( ). ↓ 7 km/h
/(___(__) 10 km
0.0 mm
$ sudo yaman c inspect 26127b728da94fd7a184549f2c0f586c | jq '.Shim.Runtime'
"microvm"
Of course the example above was using Yaman but we could also reproduce the same example4 with our good ol’ Docker friend 😉
$ echo 'wttr.in' \
| docker run --interactive bash xargs wget -qO /dev/stdout \
| docker run --runtime=microvm --interactive alpine head -n 7
Unable to find image 'bash:latest' locally
Unable to find image 'alpine:latest' locally
Digest: sha256:686d8c9dfa6f3ccfc8230bc3178d23f84eeaf7e457f36f271ab1acc53015037c
Status: Downloaded newer image for alpine:latest
Digest: sha256:b3abe4255706618c550e8db5ec0875328333a14dbf663e6f1e2b6875f45521e5
Status: Downloaded newer image for bash:latest
Weather report: Freiburg im Breisgau, Germany
\ / Sunny
.-. 15 °C
― ( ) ― ↙ 4 km/h
`-’ 10 km
/ \ 0.0 mm
$ docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5976d2c7fe86 alpine "head -n 7" About a minute ago Exited (0) About a minute ago jolly_shannon
$ docker inspect 5976d2c7fe86 | jq '.[0].HostConfig.Runtime'
"microvm"
Now, what if we want to spawn an interactive shell? Well, that’s possible too!
Unfortunately, this terminal mode does not fully work with Docker (yet) 😞
It turns out that handling IOs isn’t trivial, especially when a virtual machine is involved. The microvm runtime uses some tricks to redirect IOs correctly (e.g. it spawns a special process and uses named pipes on the QEMU side) but the terminal mode is handled differently5.
In terms of limitations, the virtual machines created by the microvm runtime don’t have any network access at the moment. In fact, the kernel is built without its network stack. This is probably something I’ll add in the future.
As for the next steps, I’d like to play with KVM, which none of my “Linux environments” give me access to at the moment. I suppose that would be better performance-wise although I haven’t considered performances at all (good thing it’s an educational project, hehe).
Overall, I am pretty happy with the results and I learned a few things. It was especially fun to play with the Linux kernel again!
-
init(1)
is the very first userspace program executed by the Linux kernel when it has finished its initialization. ↩ -
A container bundle (or “OCI bundle”) is a folder that contains all the information needed to run containers. It is usually derived from an “image” (like a “Docker image”). A bundle must contain a
config.json
file and the container’s root filesystem (which is a folder usually namedrootfs
). ↩ -
Only unknown kernel parameters that do not contain
.
are passed toinit
. Those with=
are passed as environment variables, the others are passed as arguments (argv
). ↩ -
It is possible to fully reproduce the example with these scripts. ↩
-
The terminal mode works fine with Yaman and I only noticed the problem with Docker when I wrote this blog post, sigh. ↩
Feel free to fork and edit this post if you find a typo, thank you so much! This post is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
Comments
No comments here. You can interact on Mastodon or send me an email if you prefer.