jLuger.de - Podman

Podman is a container engine where a normal user (id > 0) can start containers and there is no daemon running as root to do the actual work. Some may say "Oh wow another tool to manage Linux namespaces" but while working on hustior I've found some issues that should prevent such a tool like podman. So when they released version 1.0 this year I was very excited on how they solved the issues but couldn't find time to do the research. Well until recently.

To start with I should answer why you want a rootless container and why not Docker
Every user that can start docker is root on the machine.

You don't want this for at least two reasons:
1. Because of misconfiguration/security vulnerability unwanted users are allowed to start docker. I've done some research about docker vulnerabilities for a talk and while there aren't much you don't want be a victim of the one that pops up every few years. The topic of the misconfiguration is more important. This year a docker playground got into the news that there was a hack while in reality it was a misconfiguration. When you read https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities you will notice that the process isolation for a privileged container is more a kind of a request and not enforced like with a normal container. So you want to be very sure that the software in the container behaves nicely. The docker playground did the exact opposite. They allowed people on the internet to run any program in it. So even people that want others to learn about docker do mistakes. You shouldn't think that you will do better.
2. There are regulates environments where developers should create container (including the testing of their creation) but are under no circumstances allowed to get root. Such environments are banking, insurance, large government agencies and probably a lot more I can imagine.

Why root for Docker
As I've mentioned at the beginning I've found two issues that require root. The first is with the mount namespace and the second with the network namespace.

Mount namespace
The reason for root has a name: FS_USERNS_MOUNT
This is the name of a flag that a file system must set or the caller needs to be root on the host computer to mount it. The reason for this is the fear that an attacker will create a malicious file system and attack the kernel with it. Some argue that this is possible with flash drives but with them you can only attack the PC in front of you. The fear with Linux namespace based attacks is that 10% of Amazons server get taken over and attack the other 90%.
Some filesystems that export the flag: proc, sysfs, devpts, ramfs, cgroupfs, FUSE

Network namespace
A new network namespace is like a computer without connectivity (no network card, no WiFi, ...). You may have a use for it but definitely not as a server.
Normally you would use a virtual ethernet card (veth) to connect the namespace. However those veth devices are always created as a pair and being connected by a virtual ethernet cable which is unpluggable. As a pair of networks cards in the namespace wouldn't solve the problem you need a way to get one card to another network namespace. And there is a way but it requires root on the host computer.

Installation
The best way to get to know podman is to install it.
Fedora 29 - x86_64                               19 kB/s |  26 kB     00:01    
Abhängigkeiten sind aufgelöst.
================================================================================
 Package                    Arch   Version                        Repo     Size
================================================================================
Installieren:
 podman                     x86_64 1:1.2.0-2.git3bd528e.fc29      updates  10 M
Abhängigkeiten werden installiert:
 containernetworking-plugins
                            x86_64 0.7.4-2.fc29                   updates  13 M
 containers-common          x86_64 1:0.1.35-2.git404c5bd.fc29     updates  31 k
 fuse3-libs                 x86_64 3.4.2-2.fc29                   updates  82 k
 ostree-libs                x86_64 2019.1-3.fc29                  updates 363 k
 runc                       x86_64 2:1.0.0-85.dev.gitdd22a84.fc29 updates 2.3 M
 libnet                     x86_64 1.1.6-16.fc29                  fedora   62 k
 protobuf-c                 x86_64 1.3.0-5.fc29                   fedora   33 k
Schwache Abhängigkeiten werden installiert:
 container-selinux          noarch 2:2.95-1.gite3ebc68.fc29       updates  46 k
 criu                       x86_64 3.11-1.fc29                    updates 487 k
 fuse-overlayfs             x86_64 0.3-8.dev.gita6958ce.fc29      updates  49 k
 slirp4netns                x86_64 0.3-0.alpha.2.git30883b5.fc29  updates  71 k

Transaktionsübersicht
================================================================================
Installieren  12 Pakete      

If you look closely you will see the two packets containing the name fuse. Fuse like the filesystem that exports the required FS_USERNS_MOUNT flag. Oh and overlayfs sound like the overlay mechanism needed for docker container layers.
Another packet contains the short name for network namespace (netns).

FUSE
FUSE is short for Filesystem in Userspace. Traditionally a filesystem is part of the kernel as a filesystem translates between the files and folders you see and the byte pattern used to store them on a hard drive. This hardware management is what the kernel does. And in order to do the hardware management the kernel needs full access to all hardware. That means that everything that runs in kernel space has full access to everything on your computer.
But sometime during the evolution of a computer people wanted to use more than a hard drive to store data. One example I remember was to abuse gmail to store data (yes in gmail not drive). To keep the kernel small and secure they created the FUSE interface so that programs in userspace can provide the data. Unfortunately this solution has a cost. Not in money but in runtime performance. Each switch between user- and kernelspace costs some performance because passing security gates is never for free.

Running podman
Before investigating network lets start podman.

First as root
root      2908  0.1  3.0 922732 62576 pts/0    Sl+  02:39   0:00 podman run -ti ubuntu:19.04 bash
root      2963  0.0  0.0  77848  1980 ?        Ssl  02:39   0:00 /usr/libexec/podman/conmon -s -c 
8f470462811ad66e16aa7d2daf2733487f971d33320788dc2669ffa1a5cdd59c -u 8
root      2975  0.0  0.1   4176  3368 pts/0    Ss+  02:39   0:00 bash
      
You see the podman command as there is no daemon to do the job. Then there is the application in the container: bash. Additionally there is a container monitor conman. If you ever have to name software, please do me a favor and don't use something that looks like a misspelling of a common english word. I had a hard time to google for it.

Now as a normal user
vagrant   3413  0.0  2.9 775184 59276 pts/0    Sl+  02:44   0:00 podman run -ti ubuntu:19.04 bash
vagrant   3445  2.8  4.2 923160 85684 pts/0    Sl+  02:44   0:05 podman run -ti ubuntu:19.04 bash
vagrant   3487  1.3  0.1   5324  3196 ?        Ss   02:44   0:02 /usr/bin/fuse-overlayfs -o 
lowerdir=/home/vagrant/.local/share/containers/storage/overlay/l/XZ3SMZKBOP
vagrant   3490  0.0  0.0  77848  1944 ?        Ssl  02:44   0:00 /usr/libexec/podman/conmon -c 
f99c3b2ebde7c7e503193c82adff4a5b893f80f98607496f303cb7c71e5f8e79 -u f99c
vagrant   3500  0.0  0.0   4164   756 pts/0    Ss   02:44   0:00 bash
vagrant   3506  0.2  0.0   3276  1928 pts/0    S    02:44   0:00 /usr/bin/slirp4netns --disable-host-loopback 
--mtu 65520 -c -e 3 -r 4 3500 tap0      
Although you see two podman instance I've started it only once. The second one was started automatically. You now also see the fuse-overlayfs program providing the filesystem data and the slirp4netns program mentioned earlier. If you look closely at the program arguments of slirp4netns you will see tap0 at the end before that the id 3500. This is the pid of the bash program that is the main program in the container.

Two instances of podman as normal user
vagrant   4876  0.0  0.2 224992  4352 pts/2    Ss   08:23   0:00 /bin/bash
vagrant   4898  0.1  2.8 775100 57432 pts/2    Sl+  08:23   0:00 podman start -i -a f99c3b2ebde7
vagrant   4904  0.4  3.0 849000 61296 pts/2    Sl+  08:23   0:00 podman start -i -a f99c3b2ebde7
vagrant   4918  0.0  0.1   4128  2696 ?        Ss   08:23   0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant   4921  0.0  0.0  77848  1940 ?        Ssl  08:23   0:00 /usr/libexec/podman/conmon -c f99c3b2ebde7c7e50[...]
vagrant   4933  0.1  0.1   4052  2712 pts/0    Ss+  08:23   0:00 bash
vagrant   4939  0.0  0.0   2572   756 pts/2    S    08:23   0:00 /usr/bin/slirp4netns --disable-host-loopback 
--mtu 65520 -c -e 3 -r 4 4933 tap0
vagrant   4949  0.0  0.2 224992  4400 pts/3    Ss   08:23   0:00 /bin/bash
vagrant   4986  0.5  2.8 775184 58788 pts/3    Sl+  08:24   0:00 podman run -ti ubuntu:19.04 bash
vagrant   4992  0.9  3.0 775268 61952 pts/3    Sl+  08:24   0:00 podman run -ti ubuntu:19.04 bash
vagrant   5006  0.0  0.1   4436  2952 ?        Ss   08:24   0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant   5009  0.0  0.0  77848  1896 ?        Ssl  08:24   0:00 /usr/libexec/podman/conmon -c cc01d0d9983e[...]
vagrant   5020  0.2  0.1   4052  2532 pts/0    Ss+  08:24   0:00 bash
vagrant   5026  0.0  0.0   2572   816 pts/3    S    08:24   0:00 /usr/bin/slirp4netns --disable-host-loopback 
--mtu 65520 -c -e 3 -r 4 5020 tap0
The important part here is that everything doubles as there is no central daemon that could be reused by the second container. This also means that there is no connection of the two instances by a single program which means better isolation.

The network
Now that podman is running lets take a look at the network.
root@f99c3b2ebde7:/# ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: tap0:  mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
    link/ether 6a:26:66:d1:b8:1c brd ff:ff:ff:ff:ff:ff
    inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0
       valid_lft forever preferred_lft forever
    inet6 fe80::6826:66ff:fed1:b81c/64 scope link 
       valid_lft forever preferred_lft forever     
You see the tap0 from slirp4netns again. So what is this tap? Well https://en.wikipedia.org/wiki/TUN/TAP has a nice definition:
TAP (namely network tap) simulates a link layer device and it operates with layer 2 packets like Ethernet frames.
Packets sent by an operating system via a TUN/TAP device are delivered to a user-space program [...]

So instead of two virtual network cards like with veth you have only one virtual network  card and the other end is a user space program. In order to understand the user space program I've searched around and found the relevant code in https://github.com/rootless-containers/slirp4netns/blob/master/main.c. Here are the relevant code parts needed to understand it:
child:
if ((rc = nsenter(target_pid)) < 0) {
if ((tapfd = open_tap(tapname)) < 0) {
if (sendfd(sock, tapfd) < 0) {

parent:
waitpid(child_pid, &child_wstatus, 0);
if ((rc = do_slirp(tapfd, exit_fd, api_socket, cfg)) < 0) { 
As you can see slirp4netns forks itself and then the child process enters the container namespace. Hence it requires the pid of the container process. There it opens the tap device and then sends back the opened file descriptor (yes under linux even an opened network is seen as a file descriptor as everything is a file under unix and linux has inherited this from unix). The parent waits until it gets the file descriptor for the tap device. The handling of the network traffic is done by the slirp library.

Please note that when you start podman as a normal user you can't bind ports below 1024. This should be nothing new as a normal user can never user ports below 1024 but I just wanted to highlight it because the error message you get when you try it looks like this:
[vagrant@fedora29 ~]$ podman run -p 443:8080 -ti ubuntu:19.04 bash
Error: error from slirp4netns while setting up port redirection:
map[desc:bad request: add_hostfwd: slirp_add_hostfwd failed]
Not quite helpful to get the real reason. Oh and for all wondering if 1024 is allowed, here is the answer:
[vagrant@fedora29 ~]$ podman run -p 1024:8080 -ti ubuntu:19.04 bash
root@f42ee6a2e7e0:/# exit     
Yes it is.

Normal user in a container
As I was playing around I've created a container where the main program was run by a normal user. Then I've seen this file permissions:
mustermann@e476b77c25cc:~$ id
uid=1000(mustermann) gid=1000(mustermann) groups=1000(mustermann)
mustermann@e476b77c25cc:~$ ls -la
total 96
drwxr-xr-x. 2 mustermann mustermann    59 Apr 22 10:52 .
drwxr-xr-x. 3 root       root          24 Apr 20 20:17 ..
-rw-------. 1 mustermann mustermann  4899 Apr 22 10:52 .bash_history
-rw-r--r--. 1 mustermann mustermann   220 Jan 24 10:22 .bash_logout
mustermann@e476b77c25cc:~$ ls -l /
drwxr-xr-x.   2 root   root       6 Mar 10 05:23 opt
dr-xr-xr-x. 128 nobody nogroup    0 Apr 22 10:53 proc
drwx------.   2 root   root      37 Mar 10 05:24 root
You see that there are files belonging to the normal user (mustermann) and root. That means you have at least two users in the container. That shouldn't be possible as an unprivileged user can only map his id into a child user namespace. But when it comes to files that are dynamically generated and typically belong to root the user is mapped to the overflow user nobody (as expected).
The process list turned out to be very helpful:
vagrant  15914  0.0  2.9 775184 59244 pts/4    Sl+  13:17   0:00 podman run -ti 0e6953c6397f bash
vagrant  15920  0.0  3.0 849000 61984 pts/4    Sl+  13:17   0:00 podman run -ti 0e6953c6397f bash
vagrant  15935  0.0  0.0   4116  1572 ?        Ss   13:17   0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant  15938  0.0  0.0  77848  2024 ?        Ssl  13:17   0:00 /usr/libexec/podman/conmon -c e476b77c25cc94a22[...]
100999   15949  0.0  0.1   4180  2320 pts/0    Ss+  13:17   0:00 bash
vagrant  15955  0.0  0.0   2704  1920 pts/4    S    13:17   0:00 /usr/bin/slirp4netns --disable-host-loopback 
--mtu 65520 -c -e 3 -r 4 15949 tap0
You can see that the container process suddenly has another pid. The reason for this has a name:
/etc/subuid
and a content
vagrant:100000:65536
With this file you can have subordinate user ids. According to the description root should be 100.000 and mustermann should be 101.000 but it seems that root is mapped to vagrant and everything else is mapped minus one. While this is pretty cool and explains why you can have multiple users inside a podman container it also requires some root permission on the host. Because without root you won't get an entry into /etc/subuid.

lsns
During my research I've found a tool called lsns which is short for list namespaces. Calling it is simple but to get some insight you first need the process ids of the container programs.
vagrant   4986  0.0  2.8 922648 57368 pts/3    Sl+  05:19   0:03 podman run -ti ubuntu:19.04 bash
vagrant   4992  0.0  2.9 922732 60128 pts/3    Sl+  05:19   0:03 podman run -ti ubuntu:19.04 bash
vagrant   5020  0.0  0.0   4164  1592 pts/0    Ss+  05:19   0:00 bash
Call to lsns:
[vagrant@fedora29 ~]$ lsns
        NS TYPE   NPROCS   PID USER    COMMAND
4026532243 user        5  4992 vagrant podman run -ti ubuntu:19.04 bash
4026532244 mnt         4  4992 vagrant podman run -ti ubuntu:19.04 bash
4026532245 mnt         1  5020 vagrant bash
4026532246 uts         1  5020 vagrant bash
4026532247 ipc         1  5020 vagrant bash
4026532248 pid         1  5020 vagrant bash
4026532250 net         1  5020 vagrant bash
As you can see the pid of the first podman call is shown nowhere by lsns. But the second podman call is in its own user and mount namespace. If you look closely you see that this is the only user namespace in the list and that the container (bash) is running as vagrant which means the user in the container is root. So the second podman is root and mounts the filesystem for the container. The mount namespace for the container (bash) is needed to put the mounts of the podman command into a single block. Without it the program in the container could undo individual mounts like the one top layer that turns an image into a container. Yes the difference between an image and a container is simply a top layer and everyone that is able to go round that top layer has write access to the image.

Podman in Podman
To cut a long story short: It is not possible. Here are the error message when run as root or as normal user:
[root@37dbae636765 /]# podman run -ti ubuntu:19.04 bash
ERRO[0000] 'overlay' is not supported over  at "/var/lib/containers/storage/overlay" 
Error: error creating libpod runtime: kernel does not support overlay fs: 'overlay' is not
supported over  at "/var/lib/containers/storage/overlay": backing file system is unsupported for this graph driver


[mustermann@d455cb10fbe8 /]$ podman run -ti ubuntu:19.04 bash
Error: error creating libpod runtime: Error running podman info while refreshing state: cannot clone:
Operation not permitted
time="2019-04-25T18:37:35Z" level=error msg="cannot re-exec process" 
: exit status 1

Container per user
This one should be obvious but I like to mention it nevertheless. As there is no central daemon each user has its own images and containers.
[vagrant@fedora29 ~]$ podman ps -a
CONTAINER ID  IMAGE                           COMMAND  CREATED      STATUS                     PORTS  NAMES
f99c3b2ebde7  docker.io/library/ubuntu:19.04  bash     3 hours ago  Exited (1) 12 seconds ago         epic_margulis
[vagrant@fedora29 ~]$ sudo su
[root@fedora29 vagrant]# podman ps -a
CONTAINER ID  IMAGE                           COMMAND  CREATED      STATUS                  PORTS  NAMES
8f470462811a  docker.io/library/ubuntu:19.04  bash     4 hours ago  Exited (1) 4 hours ago         hardcore_lichterman
If you see this as a waste of storage space and hurdle because you can't access your colleagues work or if you are happy that your users share nothing is up to you.

IO-Performance
I have used FUSE already with ZFS and was a bit disappointed by its performance. So I've created some simple io performance tests. For a start I've collected 1GB of random data and copied them from one file to another. First as normal user with FUSE and then as root directly. Here are the results of
FUSE:
[vagrant@fedora29 fedora_user]$ podman run -ti fedora_user:latest bash
[mustermann@d455cb10fbe8 ~]$ date && dd if=/dev/urandom of=test bs=64k count=16k && date
Thu Apr 25 18:46:30 UTC 2019
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.24326 s, 130 MB/s
Thu Apr 25 18:46:39 UTC 2019
[mustermann@d455cb10fbe8 ~]$ date && dd if=test of=test2 && date
Thu Apr 25 18:47:10 UTC 2019
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 195.676 s, 5.5 MB/s
Thu Apr 25 18:50:26 UTC 2019
root:
[root@fedora29 fedora_user]# podman run -ti ubuntu:19.04 bash
root@46e7d09680db:/# date && dd if=/dev/urandom of=test bs=64k count=16k && date
Thu Apr 25 18:56:58 UTC 2019
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.09233 s, 176 MB/s
Thu Apr 25 18:57:04 UTC 2019
root@46e7d09680db:/# date && dd if=test of=test2 && date
Thu Apr 25 18:57:26 UTC 2019
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.6264 s, 162 MB/s
Thu Apr 25 18:57:33 UTC 2019
You can see a difference in magnitudes when copying an existing file. On the other hand collecting data from urandom stayed in the same magnitude and is probably acceptable.

So I've decided to run a simple pgbench test to see how this mixed results affect databases.
FUSE:
[vagrant@fedora29 fedora_user]$ podman run -ti postgres:11.2 

[vagrant@fedora29 ~]$ podman exec -t b4c584e6ce92 bash
root@b4c584e6ce92:/# su postgres
postgres@b4c584e6ce92:/$ psql
postgres=# CREATE DATABASE example;
postgres@b4c584e6ce92:/$ pgbench -i -s 50 example
postgres@b4c584e6ce92:/$ pgbench -c 10 -t 10000 example
starting vacuum...end.
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 6.787 ms
tps = 1473.362351 (including connections establishing)
tps = 1473.460608 (excluding connections establishing)
root:
[root@fedora29 fedora_user]# podman run -ti postgres:11.2

[root@fedora29 vagrant]# podman exec -t  2e95611db63d bash
root@2e95611db63d:/# su postgres
postgres@2e95611db63d:/$ psql
postgres=# CREATE DATABASE example;
postgres@2e95611db63d:/$ pgbench -i -s 50 example
postgres@2e95611db63d:/$ pgbench -c 10 -t 10000 example
starting vacuum...end.
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 6.346 ms
tps = 1575.794323 (including connections establishing)
tps = 1575.887435 (excluding connections establishing)
Please note that the 100 tps (transaction per second) between the two runs are totally in an error margin I've seed during several of this runs. So you can safely say that the performance is equal between FUSE and direct storage.

Final words
As I've shown podman doesn't introduce some new magic tricks. It has to deal with the same namespace restrictions like docker had to do but choose other solutions. How you view this solutions is totally up you. I think podman will find at least its niche in regulated environments where root is unwanted.