Playing with Containers (and Rust)

Oct 9 2021 software linux

Synopsis

I started working, in my free time, on my own solution to containers and VM management. The motivation was, primarily to learn, but decided it can also serve as a tool for others.

I’ve had issues with Firejail (non-programmable configuration), Guix (lack of desired features) and QubesOS cannot run on all hardware. I think the best would be to provide a solution which, does not only come with pre-configured settings, but one should be able to program their desired behaviour. And also people shouldn’t be forced to install a different operating system, so it should be integrable with anything GNU/Linux.

And so I started working on Moksha [1]. ^^

Namespaces

Containers in the spirit of Docker are making use of Linux namespaces. I highly recommend article in link [2] on LWN as a brief introduction.

At a kernel level, they separate certain kernel features: such as users, pids, network, mounts, etc.

If namespaces are configured when you compile your kernel, the kernel will already run its processes in the default namespace. The way to create and enter a namespace, is by using syscalls (such as ‘unshare’ and ‘setns’).

There are already tools that do that and let you play with the concept: unshare and nsenter.

Trying to build my own containers

First of all, I used the /proc directory to gather the required information about the system (mounts, processes, namespaces, etc).

To separate the namespaces of a process, you just have to && your namespaces (you can look in the kernel source) - check [3] for an example. Then, I forked the process.

One of the most important part is setting up the mounts. The way I did it, was to unmount everything that was possible, with the ‘procfs’ at the end. The reason behind that was that one wouldn’t be able to unmount the rest without it. Of course - some mounts depend on each other, and that’s why I iterated through all of them a couple of times, at [4].

But now the issue is that, you still have your root filesystem mounted! The way to deal with this is to use the pivot_root syscall. The call does the following: it takes the new root path, and where to put the old root. The function is implemented here [5].

Of course the last step has to be done before setting up the mounts. On of the things left to do, is to provide different types of filesystems, confined to the container. Either way, after the old root can be unmounted safely.

I guess now I’m at the point that the parent process should be able to die, and the stdout of the ‘init’ process should be redirected somewhere else, and the whole state should be managed.

Appendix