To start with I should answer why you want a rootless container and why not Docker
Every user that can start docker is root on the machine.
You don't want this for at least two reasons:
1. Because of misconfiguration/security vulnerability unwanted users are allowed to start docker. I've done some research about docker vulnerabilities for a talk and while there aren't much you don't want be a victim of the one that pops up every few years. The topic of the misconfiguration is more important. This year a docker playground got into the news that there was a hack while in reality it was a misconfiguration. When you read https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities you will notice that the process isolation for a privileged container is more a kind of a request and not enforced like with a normal container. So you want to be very sure that the software in the container behaves nicely. The docker playground did the exact opposite. They allowed people on the internet to run any program in it. So even people that want others to learn about docker do mistakes. You shouldn't think that you will do better.
2. There are regulates environments where developers should create container (including the testing of their creation) but are under no circumstances allowed to get root. Such environments are banking, insurance, large government agencies and probably a lot more I can imagine.
Why root for Docker
As I've mentioned at the beginning I've found two issues that require root. The first is with the mount namespace and the second with the network namespace.
Mount namespace
The reason for root has a name: FS_USERNS_MOUNT
This is the name of a flag that a file system must set or the caller needs to be root on the host computer to mount it. The reason for this is the fear that an attacker will create a malicious file system and attack the kernel with it. Some argue that this is possible with flash drives but with them you can only attack the PC in front of you. The fear with Linux namespace based attacks is that 10% of Amazons server get taken over and attack the other 90%.
Some filesystems that export the flag: proc, sysfs, devpts, ramfs, cgroupfs, FUSE
Network namespace
A new network namespace is like a computer without connectivity (no network card, no WiFi, ...). You may have a use for it but definitely not as a server.
Normally you would use a virtual ethernet card (veth) to connect the namespace. However those veth devices are always created as a pair and being connected by a virtual ethernet cable which is unpluggable. As a pair of networks cards in the namespace wouldn't solve the problem you need a way to get one card to another network namespace. And there is a way but it requires root on the host computer.
Installation
The best way to get to know podman is to install it.
Fedora 29 - x86_64 19 kB/s | 26 kB 00:01
Abhängigkeiten sind aufgelöst.
================================================================================
Package Arch Version Repo Size
================================================================================
Installieren:
podman x86_64 1:1.2.0-2.git3bd528e.fc29 updates 10 M
Abhängigkeiten werden installiert:
containernetworking-plugins
x86_64 0.7.4-2.fc29 updates 13 M
containers-common x86_64 1:0.1.35-2.git404c5bd.fc29 updates 31 k
fuse3-libs x86_64 3.4.2-2.fc29 updates 82 k
ostree-libs x86_64 2019.1-3.fc29 updates 363 k
runc x86_64 2:1.0.0-85.dev.gitdd22a84.fc29 updates 2.3 M
libnet x86_64 1.1.6-16.fc29 fedora 62 k
protobuf-c x86_64 1.3.0-5.fc29 fedora 33 k
Schwache Abhängigkeiten werden installiert:
container-selinux noarch 2:2.95-1.gite3ebc68.fc29 updates 46 k
criu x86_64 3.11-1.fc29 updates 487 k
fuse-overlayfs x86_64 0.3-8.dev.gita6958ce.fc29 updates 49 k
slirp4netns x86_64 0.3-0.alpha.2.git30883b5.fc29 updates 71 k
Transaktionsübersicht
================================================================================
Installieren 12 Pakete
If you look closely you will see the two packets containing the name fuse. Fuse like the filesystem that exports the required FS_USERNS_MOUNT flag. Oh and overlayfs sound like the overlay mechanism needed for docker container layers.
Another packet contains the short name for network namespace (netns).
FUSE
FUSE is short for Filesystem in Userspace. Traditionally a filesystem is part of the kernel as a filesystem translates between the files and folders you see and the byte pattern used to store them on a hard drive. This hardware management is what the kernel does. And in order to do the hardware management the kernel needs full access to all hardware. That means that everything that runs in kernel space has full access to everything on your computer.
But sometime during the evolution of a computer people wanted to use more than a hard drive to store data. One example I remember was to abuse gmail to store data (yes in gmail not drive). To keep the kernel small and secure they created the FUSE interface so that programs in userspace can provide the data. Unfortunately this solution has a cost. Not in money but in runtime performance. Each switch between user- and kernelspace costs some performance because passing security gates is never for free.
Running podman
Before investigating network lets start podman.
First as root
root 2908 0.1 3.0 922732 62576 pts/0 Sl+ 02:39 0:00 podman run -ti ubuntu:19.04 bash
root 2963 0.0 0.0 77848 1980 ? Ssl 02:39 0:00 /usr/libexec/podman/conmon -s -c
8f470462811ad66e16aa7d2daf2733487f971d33320788dc2669ffa1a5cdd59c -u 8
root 2975 0.0 0.1 4176 3368 pts/0 Ss+ 02:39 0:00 bash
You see the podman command as there is no daemon to do the job. Then
there is the application in the container: bash. Additionally there
is a container monitor conman. If you ever have to name software,
please do me a favor and don't use something that looks like a
misspelling of a common english word. I had a hard time to google
for it.Now as a normal user
vagrant 3413 0.0 2.9 775184 59276 pts/0 Sl+ 02:44 0:00 podman run -ti ubuntu:19.04 bash
vagrant 3445 2.8 4.2 923160 85684 pts/0 Sl+ 02:44 0:05 podman run -ti ubuntu:19.04 bash
vagrant 3487 1.3 0.1 5324 3196 ? Ss 02:44 0:02 /usr/bin/fuse-overlayfs -o
lowerdir=/home/vagrant/.local/share/containers/storage/overlay/l/XZ3SMZKBOP
vagrant 3490 0.0 0.0 77848 1944 ? Ssl 02:44 0:00 /usr/libexec/podman/conmon -c
f99c3b2ebde7c7e503193c82adff4a5b893f80f98607496f303cb7c71e5f8e79 -u f99c
vagrant 3500 0.0 0.0 4164 756 pts/0 Ss 02:44 0:00 bash
vagrant 3506 0.2 0.0 3276 1928 pts/0 S 02:44 0:00 /usr/bin/slirp4netns --disable-host-loopback
--mtu 65520 -c -e 3 -r 4 3500 tap0
Although you see two podman instance I've started it only once. The
second one was started automatically. You now also see the
fuse-overlayfs program providing the filesystem data and the
slirp4netns program mentioned earlier. If you look closely at the
program arguments of slirp4netns you will see tap0 at the end before
that the id 3500. This is the pid of the bash program that is the
main program in the container.Two instances of podman as normal user
vagrant 4876 0.0 0.2 224992 4352 pts/2 Ss 08:23 0:00 /bin/bash
vagrant 4898 0.1 2.8 775100 57432 pts/2 Sl+ 08:23 0:00 podman start -i -a f99c3b2ebde7
vagrant 4904 0.4 3.0 849000 61296 pts/2 Sl+ 08:23 0:00 podman start -i -a f99c3b2ebde7
vagrant 4918 0.0 0.1 4128 2696 ? Ss 08:23 0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant 4921 0.0 0.0 77848 1940 ? Ssl 08:23 0:00 /usr/libexec/podman/conmon -c f99c3b2ebde7c7e50[...]
vagrant 4933 0.1 0.1 4052 2712 pts/0 Ss+ 08:23 0:00 bash
vagrant 4939 0.0 0.0 2572 756 pts/2 S 08:23 0:00 /usr/bin/slirp4netns --disable-host-loopback
--mtu 65520 -c -e 3 -r 4 4933 tap0
vagrant 4949 0.0 0.2 224992 4400 pts/3 Ss 08:23 0:00 /bin/bash
vagrant 4986 0.5 2.8 775184 58788 pts/3 Sl+ 08:24 0:00 podman run -ti ubuntu:19.04 bash
vagrant 4992 0.9 3.0 775268 61952 pts/3 Sl+ 08:24 0:00 podman run -ti ubuntu:19.04 bash
vagrant 5006 0.0 0.1 4436 2952 ? Ss 08:24 0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant 5009 0.0 0.0 77848 1896 ? Ssl 08:24 0:00 /usr/libexec/podman/conmon -c cc01d0d9983e[...]
vagrant 5020 0.2 0.1 4052 2532 pts/0 Ss+ 08:24 0:00 bash
vagrant 5026 0.0 0.0 2572 816 pts/3 S 08:24 0:00 /usr/bin/slirp4netns --disable-host-loopback
--mtu 65520 -c -e 3 -r 4 5020 tap0
The important part here is that everything doubles as there is no
central daemon that could be reused by the second container. This
also means that there is no connection of the two instances by a
single program which means better isolation.The network
Now that podman is running lets take a look at the network.
root@f99c3b2ebde7:/# ip a
1: lo: mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tap0: mtu 65520 qdisc fq_codel state UNKNOWN group default qlen 1000
link/ether 6a:26:66:d1:b8:1c brd ff:ff:ff:ff:ff:ff
inet 10.0.2.100/24 brd 10.0.2.255 scope global tap0
valid_lft forever preferred_lft forever
inet6 fe80::6826:66ff:fed1:b81c/64 scope link
valid_lft forever preferred_lft forever
You see the tap0 from slirp4netns again. So what is this tap? Well https://en.wikipedia.org/wiki/TUN/TAP
has a nice definition:TAP (namely network tap) simulates a link layer device and it operates with layer 2 packets like Ethernet frames.
Packets sent by an operating system via a TUN/TAP device are delivered to a user-space program [...]
So instead of two virtual network cards like with veth you have only one virtual network card and the other end is a user space program. In order to understand the user space program I've searched around and found the relevant code in https://github.com/rootless-containers/slirp4netns/blob/master/main.c. Here are the relevant code parts needed to understand it:
child:
if ((rc = nsenter(target_pid)) < 0) {
if ((tapfd = open_tap(tapname)) < 0) {
if (sendfd(sock, tapfd) < 0) {
parent:
waitpid(child_pid, &child_wstatus, 0);
if ((rc = do_slirp(tapfd, exit_fd, api_socket, cfg)) < 0) {
As you can see slirp4netns forks itself and then the child process
enters the container namespace. Hence it requires the pid of the
container process. There it opens the tap device and then sends back
the opened file descriptor (yes under linux even an opened network
is seen as a file descriptor as everything is a file under unix and
linux has inherited this from unix). The parent waits until it gets
the file descriptor for the tap device. The handling of the network
traffic is done by the slirp library.Please note that when you start podman as a normal user you can't bind ports below 1024. This should be nothing new as a normal user can never user ports below 1024 but I just wanted to highlight it because the error message you get when you try it looks like this:
[vagrant@fedora29 ~]$ podman run -p 443:8080 -ti ubuntu:19.04 bash
Error: error from slirp4netns while setting up port redirection:
map[desc:bad request: add_hostfwd: slirp_add_hostfwd failed]
Not quite helpful to get the real reason. Oh and for all wondering
if 1024 is allowed, here is the answer:
[vagrant@fedora29 ~]$ podman run -p 1024:8080 -ti ubuntu:19.04 bash
root@f42ee6a2e7e0:/# exit
Yes it is.Normal user in a container
As I was playing around I've created a container where the main program was run by a normal user. Then I've seen this file permissions:
mustermann@e476b77c25cc:~$ id
uid=1000(mustermann) gid=1000(mustermann) groups=1000(mustermann)
mustermann@e476b77c25cc:~$ ls -la
total 96
drwxr-xr-x. 2 mustermann mustermann 59 Apr 22 10:52 .
drwxr-xr-x. 3 root root 24 Apr 20 20:17 ..
-rw-------. 1 mustermann mustermann 4899 Apr 22 10:52 .bash_history
-rw-r--r--. 1 mustermann mustermann 220 Jan 24 10:22 .bash_logout
mustermann@e476b77c25cc:~$ ls -l /
drwxr-xr-x. 2 root root 6 Mar 10 05:23 opt
dr-xr-xr-x. 128 nobody nogroup 0 Apr 22 10:53 proc
drwx------. 2 root root 37 Mar 10 05:24 root
You see that there are files belonging to the normal user
(mustermann) and root. That means you have at least two users in the
container. That shouldn't be possible as an unprivileged user can
only map his id into a child user namespace. But when it comes to
files that are dynamically generated and typically belong to root
the user is mapped to the overflow user nobody (as expected).The process list turned out to be very helpful:
vagrant 15914 0.0 2.9 775184 59244 pts/4 Sl+ 13:17 0:00 podman run -ti 0e6953c6397f bash
vagrant 15920 0.0 3.0 849000 61984 pts/4 Sl+ 13:17 0:00 podman run -ti 0e6953c6397f bash
vagrant 15935 0.0 0.0 4116 1572 ? Ss 13:17 0:00 /usr/bin/fuse-overlayfs -o lowerdir=/home/[...]
vagrant 15938 0.0 0.0 77848 2024 ? Ssl 13:17 0:00 /usr/libexec/podman/conmon -c e476b77c25cc94a22[...]
100999 15949 0.0 0.1 4180 2320 pts/0 Ss+ 13:17 0:00 bash
vagrant 15955 0.0 0.0 2704 1920 pts/4 S 13:17 0:00 /usr/bin/slirp4netns --disable-host-loopback
--mtu 65520 -c -e 3 -r 4 15949 tap0
You can see that the container process suddenly has another pid. The
reason for this has a name:/etc/subuid
and a content
vagrant:100000:65536
With this file you can have subordinate user ids. According to the description root should be 100.000 and mustermann should be 101.000 but it seems that root is mapped to vagrant and everything else is mapped minus one. While this is pretty cool and explains why you can have multiple users inside a podman container it also requires some root permission on the host. Because without root you won't get an entry into /etc/subuid.
lsns
During my research I've found a tool called lsns which is short for list namespaces. Calling it is simple but to get some insight you first need the process ids of the container programs.
vagrant 4986 0.0 2.8 922648 57368 pts/3 Sl+ 05:19 0:03 podman run -ti ubuntu:19.04 bash
vagrant 4992 0.0 2.9 922732 60128 pts/3 Sl+ 05:19 0:03 podman run -ti ubuntu:19.04 bash
vagrant 5020 0.0 0.0 4164 1592 pts/0 Ss+ 05:19 0:00 bash
Call to lsns:[vagrant@fedora29 ~]$ lsns
NS TYPE NPROCS PID USER COMMAND
4026532243 user 5 4992 vagrant podman run -ti ubuntu:19.04 bash
4026532244 mnt 4 4992 vagrant podman run -ti ubuntu:19.04 bash
4026532245 mnt 1 5020 vagrant bash
4026532246 uts 1 5020 vagrant bash
4026532247 ipc 1 5020 vagrant bash
4026532248 pid 1 5020 vagrant bash
4026532250 net 1 5020 vagrant bash
As you can see the pid of the first podman call is shown nowhere by
lsns. But the second podman call is in its own user and mount
namespace. If you look closely you see that this is the only user
namespace in the list and that the container (bash) is running as
vagrant which means the user in the container is root. So the second
podman is root and mounts the filesystem for the container. The
mount namespace for the container (bash) is needed to put the mounts
of the podman command into a single block. Without it the program in
the container could undo individual mounts like the one top layer
that turns an image into a container. Yes the difference between an
image and a container is simply a top layer and everyone that is
able to go round that top layer has write access to the image.Podman in Podman
To cut a long story short: It is not possible. Here are the error message when run as root or as normal user:
[root@37dbae636765 /]# podman run -ti ubuntu:19.04 bash
ERRO[0000] 'overlay' is not supported over at "/var/lib/containers/storage/overlay"
Error: error creating libpod runtime: kernel does not support overlay fs: 'overlay' is not
supported over at "/var/lib/containers/storage/overlay": backing file system is unsupported for this graph driver
[mustermann@d455cb10fbe8 /]$ podman run -ti ubuntu:19.04 bash
Error: error creating libpod runtime: Error running podman info while refreshing state: cannot clone:
Operation not permitted
time="2019-04-25T18:37:35Z" level=error msg="cannot re-exec process"
: exit status 1
Container per user
This one should be obvious but I like to mention it nevertheless. As there is no central daemon each user has its own images and containers.
[vagrant@fedora29 ~]$ podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f99c3b2ebde7 docker.io/library/ubuntu:19.04 bash 3 hours ago Exited (1) 12 seconds ago epic_margulis
[vagrant@fedora29 ~]$ sudo su
[root@fedora29 vagrant]# podman ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8f470462811a docker.io/library/ubuntu:19.04 bash 4 hours ago Exited (1) 4 hours ago hardcore_lichterman
If you see this as a waste of storage space and hurdle because you
can't access your colleagues work or if you are happy that your
users share nothing is up to you.IO-Performance
I have used FUSE already with ZFS and was a bit disappointed by its performance. So I've created some simple io performance tests. For a start I've collected 1GB of random data and copied them from one file to another. First as normal user with FUSE and then as root directly. Here are the results of
FUSE:
[vagrant@fedora29 fedora_user]$ podman run -ti fedora_user:latest bash
[mustermann@d455cb10fbe8 ~]$ date && dd if=/dev/urandom of=test bs=64k count=16k && date
Thu Apr 25 18:46:30 UTC 2019
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 8.24326 s, 130 MB/s
Thu Apr 25 18:46:39 UTC 2019
[mustermann@d455cb10fbe8 ~]$ date && dd if=test of=test2 && date
Thu Apr 25 18:47:10 UTC 2019
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 195.676 s, 5.5 MB/s
Thu Apr 25 18:50:26 UTC 2019
root:[root@fedora29 fedora_user]# podman run -ti ubuntu:19.04 bash
root@46e7d09680db:/# date && dd if=/dev/urandom of=test bs=64k count=16k && date
Thu Apr 25 18:56:58 UTC 2019
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.09233 s, 176 MB/s
Thu Apr 25 18:57:04 UTC 2019
root@46e7d09680db:/# date && dd if=test of=test2 && date
Thu Apr 25 18:57:26 UTC 2019
2097152+0 records in
2097152+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 6.6264 s, 162 MB/s
Thu Apr 25 18:57:33 UTC 2019
You can see a difference in magnitudes when copying an existing
file. On the other hand collecting data from urandom stayed in the
same magnitude and is probably acceptable. So I've decided to run a simple pgbench test to see how this mixed results affect databases.
FUSE:
[vagrant@fedora29 fedora_user]$ podman run -ti postgres:11.2
[vagrant@fedora29 ~]$ podman exec -t b4c584e6ce92 bash
root@b4c584e6ce92:/# su postgres
postgres@b4c584e6ce92:/$ psql
postgres=# CREATE DATABASE example;
postgres@b4c584e6ce92:/$ pgbench -i -s 50 example
postgres@b4c584e6ce92:/$ pgbench -c 10 -t 10000 example
starting vacuum...end.
transaction type:
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 6.787 ms
tps = 1473.362351 (including connections establishing)
tps = 1473.460608 (excluding connections establishing)
root:[root@fedora29 fedora_user]# podman run -ti postgres:11.2
[root@fedora29 vagrant]# podman exec -t 2e95611db63d bash
root@2e95611db63d:/# su postgres
postgres@2e95611db63d:/$ psql
postgres=# CREATE DATABASE example;
postgres@2e95611db63d:/$ pgbench -i -s 50 example
postgres@2e95611db63d:/$ pgbench -c 10 -t 10000 example
starting vacuum...end.
transaction type:
scaling factor: 50
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 10000
number of transactions actually processed: 100000/100000
latency average = 6.346 ms
tps = 1575.794323 (including connections establishing)
tps = 1575.887435 (excluding connections establishing)
Please note that the 100 tps (transaction per second) between the
two runs are totally in an error margin I've seed during several of
this runs. So you can safely say that the performance is equal
between FUSE and direct storage.Final words
As I've shown podman doesn't introduce some new magic tricks. It has to deal with the same namespace restrictions like docker had to do but choose other solutions. How you view this solutions is totally up you. I think podman will find at least its niche in regulated environments where root is unwanted.