Deep dive into containers

21 June 2022 — Freiburg, Germany

It (almost) all started with this talk from Liz Rice that I found in my Pocket list. I spent some time on a Sunday afternoon to write the same code and decided to study more in-depth. I wanted to better understand what was behind containers and how the different technologies interacted with each other.

That was a month ago or so and things got out of control pretty quickly 😅 Given there are many talks and articles about containers already, this article is more of a “Show and Tell” in which I describe what I’ve poked around.

Yacr: Yet another container runtime

At this level, there is no concept of images, registries, volumes, etc. An Open Container Initiative (OCI)-compliant runtime takes an identifier and a “bundle” as input. A bundle is a folder that contains a config.json file and a root filesystem (“rootfs”).

I started by updating the code I had initially written to build a minimal (and insecure¹) runtime named Yacr. I learned a lot of low-level information and I was super happy when Docker was able to use my very own runtime. Yacr works with containerd, too!

As for the implementation, I followed the runtime spec, which is a rather high-level description of a runtime and not really a specification per se. In fact, the runtime I wrote had to be “runc-compliant” in order to be used by other tools (runc is the reference implementation of the runtime spec).

I then looked into container shims.

Yacs: Yet another container shim

When a runtime starts a container, it often uses exec(3) to replace itself with the actual container process. This is a problem because (1) standard input/output are no longer (easily) available, and (2) it is difficult to know when the process exits and why.

While it might be possible to poll /proc to solve (2), that wouldn’t solve (1). We could make the runtime a long-running process but that does not seem ideal either. The concept of container shims has been introduced to solve these two problems (and more) in an elegant manner.

Shims sit between a container manager and a container runtime. In principle, a shim invokes an OCI runtime to create/start containers. In addition, shims solve the two problems stated above by:

becoming a subreaper, which allows them to reap (adopt) any child processes created by their own child processes (e.g. the runtime). This, in turns, allows shims to be notified when child processes exit
keeping open the container’s input/output. For instance, my implementation creates FIFOs (named pipes) so that it is possible to interact with the container process at any time

The prototype I implemented is named Yacs and it uses the Yacr runtime by default. Yacs should work with any OCI runtime but I only tried with yacr and runc.

Example

Yacs provides an HTTP API via a unix socket because that was easy to implement. In the example below, we create a new container with yacs, which will invoke the runtime (yacr) to create the container.

As per the runtime spec, the container is only created, not started yet, which is why we see the yacr create container process under the yacs process in the ps output. The yacr process waits for the “start” command.

$ yacs --bundle=/tmp/alpine-bundle --container-id=alpine
/home/gitpod/.run/yacs/alpine/shim.sock

$ ps auxf
USER    PID    COMMAND
[...]
gitpod  44458  yacs --bundle=/tmp/alpine-bundle --container-id=alpine
gitpod  44488   \_ yacr --log-format json --log /home/gitpod/.run/yacs/alpine/yacr.log create container alpine --root /home/gitpod/.run/yacr --bundle /tmp/alpine-bundle

When we ask the shim to start the container using the HTTP API, it invokes the runtime again to start the container. At this point, the container process (sh /hello-loop.sh in this example) should be running under the yacs (subreaper) process (see ps output below).

$ curl -X POST -d 'cmd=start' --unix-socket /home/gitpod/.run/yacs/alpine/shim.sock http://shim/
{
  "id": "alpine",
  "runtime": "yacr",
  "state": {
    "ociVersion": "1.0.2",
    "id": "alpine",
    "status": "running",
    "pid": 44488,
    "bundle": "/tmp/alpine-bundle"
  },
  "status": {}
}

$ ps auxf
USER    PID    COMMAND
[...]
gitpod  44458  yacs --bundle=/tmp/alpine-bundle --container-id=alpine
gitpod  44488   \_ sh /hello-loop.sh
gitpod  55758       \_ sleep 1

Yacs saves the output of the container process in a JSON file (called a “log file”) to read the output after the container process has died (useful for container managers). We can use the HTTP API to fetch these logs:

$ curl --unix-socket /home/gitpod/.run/yacs/alpine/shim.sock http://shim/logs
[...]
{"m":"Hello!","s":"stdout","t":"2022-06-12T11:51:44.947554491Z"}
{"m":"Hello!","s":"stdout","t":"2022-06-12T11:51:45.948493454Z"}

Each line in a log file is a JSON object with the output message (m), the stream (s) and the timestamp (t). Container managers can then implement a logs command that can read this log file and print each message to the right stream (stdout or stderr).

Yacs also supports console sockets. In which case, logs are not available. The HTTP API supports other commands to send signals to a container, delete it and even terminate the shim itself.

Yaman: Yet another (container) manager

With a somewhat functional runtime and a shim that could do a few things correctly, I decided to look into container managers (like Docker and Podman).

My initial idea was to write the minimum amount of code to use an existing Docker image with the two tools I had written and without Docker. Ha, ha.

I ended up writing a daemon-less container manager that creates and manages rootless containers. These containers can even reach the Internet (which isn’t a given)! I tried to make container IOs work correctly as well, with support for interactive mode and pseudo-terminal.

It was actually “simpler” to implement a daemon-less manager than implementing a daemon with an API and a client CLI to talk to it. I also find this approach more elegant in general but that’s my personal opinion.

Examples

The first example pipes wttr.in to a first container that reads from its standard input (stdin) in order to call wget, which will print its output to stdout (workaround for not having curl in the container). This first container should be automatically removed when it exits because the --rm option has been specified.

The output of the first container is piped into a second container (running an “alpine” image from a different registry), which will only take the first 7 lines of the input it receives.

We then get the final output of these commands into our terminal 🎉

$ echo 'wttr.in' \
  | yaman c run --rm --interactive docker.io/library/alpine -- xargs wget -qO /dev/stdout \
  | yaman c run --interactive quay.io/aptible/alpine -- head -n 7
Weather report: Freiburg im Breisgau, Germany

   _`/"".-.     Thundery outbreaks possible
    ,\_(   ).   17 °C
     /(___(__)  ↘ 4 km/h
      ⚡‘‘⚡‘‘  9 km
      ‘ ‘ ‘ ‘   0.0 mm

// We list all the containers and observe that the first container has been
// removed automatically. The second one is still listed.
$ yaman c ls -a
CONTAINER ID                       IMAGE                           COMMAND       CREATED          STATUS                      NAME
10bddbfd480c46ffbbc8a5005134e1d7   quay.io/aptible/alpine:latest   head -n 7     35 seconds ago   Exited (0) 35 seconds ago   bold_zhukovsky

// Even if the second container has exited, we can still fetch its logs. We ask
// Yaman to print the logs with the timestamps.
$ yaman c logs --timestamps 10bddbfd480c46ffbbc8a5005134e1d7
2022-06-21T07:04:06Z - Weather report: Freiburg im Breisgau, Germany
2022-06-21T07:04:06Z -
2022-06-21T07:04:06Z -    _`/"".-.     Thundery outbreaks possible
2022-06-21T07:04:06Z -     ,\_(   ).   17 °C
2022-06-21T07:04:06Z -      /(___(__)  ↘ 4 km/h
2022-06-21T07:04:06Z -       ⚡‘‘⚡‘‘  9 km
2022-06-21T07:04:06Z -       ‘ ‘ ‘ ‘   0.0 mm

The second example below shows that we can re-attach to a container started in detached mode (with terminal). This new container was created by runc and we specified a custom hostname:

$  yaman c run --rm --runtime runc --hostname ubuntu-demo -it -d docker.io/library/ubuntu
47988b8c7bde4f3d8e84568faea3e3f4

$ yaman c attach 47988b8c7bde4f3d8e84568faea3e3f4
root@ubuntu-demo:/# tty
/dev/pts/0
root@ubuntu-demo:/# ls
bin   dev  home  lib32  libx32  mnt  proc  run   srv  tmp  var
boot  etc  lib   lib64  media   opt  root  sbin  sys  usr
root@ubuntu-demo:/# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

Internals

Under the hood, this container manager, named Yaman, relies on fuse-overlayfs in rootless mode, native OverlayFS in rootfull mode (e.g. when Yaman is executed with sudo) and slirp4netns for the network layer.

That has been a lot of work² and I took many shortcuts and introduced a few hacks to have a functional tool in the end. For example, images and layers management isn’t great to say the least. Yaman is full of limitations and probably bugs as well. Some of the known limitations have been documented. The other ones are yet to be found 🙈

Thanks to the power of abstractions and specifications, it is possible to use any OCI-compliant runtime with Yaman. That should make the whole thing a bit more reliable and secure³ 😬

Conclusion

I learned so much recently! If you want to know more about this work, you can find Yacr, Yacs and Yaman on GitHub. Feel free to try them out on Gitpod or locally using Vagrant.

I have been using Docker for many years without necessarily questioning why things were what they were or how this whole thing was actually working under the hood.

Now when I have a more specific question about Docker (or containerd or Podman), I can follow the source code and usually think “oh, that makes sense” or “ha, yeah, clever!”. This happened a few times lately and that put a smile on my face every time.

And it’s enough to consider this deep dive a good investment of my free time! ❤️

For instance, cgroups, capabilities and seccomp are not supported. ↩
Both the distribution spec and image spec have been helpful while implementing Yaman. ↩
No. ↩

ℹ️ Feel free to fork and edit this post if you find a typo, thank you so much! This post is licensed under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Credits

Photo used on social media by ines mills.

Comments

You can interact on Mastodon or send me an email if you prefer.

William Durand