There are terminology issues here, both in the Lemmy post title, in the article body, and in the article’s TL;DR. Basically, nothing is internally consistent except maybe the OCI Runtime spec itself, although its terminological relevancy is a separate issue.
Lemmy title: Containers are not Linux containers
Article title: What Is a Standard Container: Diving Into the OCI Runtime Spec
Both titles imply the existence of non-Linux containers, yet only the latter actually describes the contents of the article, specifically naming the “other” type of container, being “Standard Containers” defined by the OCI Runtime spec. As a title, I greatly prefer the latter, whereas the former is unnecessarily antagonistic.
That aside, the article could really be helped by a central glossary section, as it refers to all of these as containers, without prefacing that these can all validly be called “containers”:
- OCI-compliant containers
- Standard containers
- Linux containers
- Docker containers
- Kata VM-based containers
- Other VM-based containers that have been deprecated
If the goal was to distinguish what each of these mean, the article doesn’t do that great of a job, other than to say “these exist and aren’t Linux containers, except Linux containers are obviously Linux containers”.
Reframing what I think the article tried to convey, while borrowing some terminology from C++/Python, the OCI Runtime specification defines an Abstract Base Class known as a Standard Container. A Standard Container supports the most minimal functions of starting and stopping an execution runtime. For Linux, FreeBSD, Kata, etc, those containers are subclasses of the Standard Container.
For the most part, unless your containerized application is purely computational and has zero dependencies upon the OS, your container will be one of the subclasses. There are essentially zero practical container images that can meet the zero-dependency requirements of being a Standard Container. So while it’s true that any runtime capable of running the container subclasses could also run a Standard Container, it is of little value in production. Hence why I assert that it’s an abstract base class: it cannot really be instantiated in real life.
This is the reality of containers: none can abstract away an application’s dependency upon the OS. The container will still rely upon Win32 calls, POSIX calls, /proc, BSD sockets, or whatever else. So necessarily, all practical containers need a kernel layer. Even the case of Kata’s VM-based containers just mean that the kernel is included within the container. Portability in this context just means that the kernel version can change beneath, but you cannot take a Linux container and run it on FreeBSD, not without shims and other runtime kludges.
Also, my understanding of containers on Linux (which could be wrong) was that containers are literally just processes running within specific namespaces and specific capabilities set. When a container runs multiple processes that is just multiple processes running on the host all within the same set of namespaces.
In which case the title is just straight up wrong. Reading the article, they seem to be using a very tenuous technicality to justify the title, which is that a container is not a Linux process: because it can be multiple processes, or it could be running on an OS other than Linux.
Lame.
Your understanding is not wrong, but “within namespaces” is doing a lot of the heavy lifting. After all, there isn’t just one namespace but many simultaneous namespaces at play. A process namespace is where process IDs (PIDs) begin from 1 and fork()'d processes are assigned incrementing PIDs. These values are meaningless outside of the namespace, and might even get mapped to different values in the parent namespace. A process namespace gives the appearance that the process with PID 1 is the init process, which is customarily the first userspace process started once the kernel is running.
There are also network namespaces, where network interfaces (netif) can be switched (Layer 2) or routed (Layer 3), independent of what the global/default/parent network namespace is doing. This gives the appearance that all the network configuration is wholly independent, and allows neat things like crafting specialty routing (eg Kubernetes overlay networks).
Then there are user namespaces, where the root user has the appearance of total authority, and normal users can be created, but these are entirely distinct from the global/default/parent users and groups on the machine. This pairs well with filesystem namespaces, where a sub-tree of the real filesystem is treated as though it is a full tree, which allows the namespaced users to do standard manipulations like changing file ownership or permissions. This is essentially what UNIX chroot() does, but IIRC, chroot() did not also create user namespaces.
Taken together, namespaces in Linux are less about isolation – although they certainly work for that – and more about abstracting everything else in userspace away: no need to deal with other people’s processes, netifs, files. It’s like having the whole machine to yourself. In the history of computer science, isolation is often achieved precisely by making everything else invisible and out of the way. Virtual Memory did that, as did x86 Protected Mode, as did Virtual Machines. And so too does namespacing. Containers are the result of namespacing all the key kernel interfaces.
Perhaps the crucial thing then is what interfaces aren’t namespaced. In Linux, a big one is device drivers. Folks that want to share a USB TV capture card or a PCIe GPU or even a sub-NIC using SR-IOV, will find that /dev files are not namespaces. They exist in the global space and aren’t isolated. So the only thing that can be done is to pretend to “move” the device file into a container, with everyone else promising not to try using that device anyway. This is not isolation because accidental or malicious action will break it. To do “device isolation” would require every driver to be namespace aware, so that it could treat requests from two different namespaces as distinct. That does not exist at all in Linux, and such low-level work continues to be difficult with containers, often surprising people that think that Linux containers are complete abstractions. They are not.
thx. I just reposted from reddit



