General security

How Docker primitives secure container environments

Srinivas
March 16, 2021 by
Srinivas

Docker makes heavy use of Linux kernel features. One of the fundamental aspects that containers make use of from Linux Kernel are namespaces and cgroups. This article provides an overview of various docker security primitives that can be leveraged when using docker containers.

We will discuss namespaces, cgroups, capabilities, seccomp profiles and apparmor profiles. 

Namespaces  

One of the primary concerns when using containers is isolation between the containers and host as well as the isolation among different containers. Imagine that we spin up two containers with different sets of features and there is no need for each container process to know what's running on the other container.

Similarly, let us consider another scenario where there are 3 Apache web servers running in 3 different containers. All three containers will need to start the Apache servers on port 80. In addition to it, the host machine should also be able to use port 80 for another service. These concerns are addressed in containers using a Linux kernel feature called namespaces. Namespaces are a feature of the Linux kernel that partitions kernel resources such that one set of processes sees one set of resources and another set of processes sees a different set of resources. Thus Docker uses namespaces to provide this isolation to the containers from the host. 

The command lsns on a Docker host shows the list of namespaces being used by Docker.

$ lsns

        NS TYPE   NPROCS   PID USER   COMMAND

4026531835 cgroup     73  1516 docker /lib/systemd/systemd --user

4026531836 pid        73  1516 docker /lib/systemd/systemd --user

4026531837 user       73  1516 docker /lib/systemd/systemd --user

4026531838 uts        73  1516 docker /lib/systemd/systemd --user

4026531839 ipc        73  1516 docker /lib/systemd/systemd --user

4026531840 mnt        73  1516 docker /lib/systemd/systemd --user

4026531992 net        73  1516 docker /lib/systemd/systemd --user

$

As we can observe in the preceding excerpt, docker engine uses 6 different namespaces namely:

  1. PID namespace for process isolation.
  2. USER namespace for the user privilege isolation.
  3. UTS namespace for isolating kernel and version identifiers.
  4. IPC namespace for managing access to IPC resources.
  5. MNT namespace for managing filesystem mount points.
  6. NET namespace for managing network interfaces.

Cgroups

Control Groups (cgroups) is a feature of the Linux kernel that allows us to limit the access processes and containers have to system resources such as CPU,  RAM, IOPS, and network. A cgroup limits an application to a specific set of resources that allow the Docker engine to share available hardware resources to containers and optionally enforce limits and constraints. Cgroup entries on a Ubuntu machine can be found at the following location.

/sys/fs/cgroup/

There are several directories within this directory for various resources such as CPU, Memory and PIDs. We can see the list of directories in the following excerpt.

$ ls /sys/fs/cgroup/ blkio  cpu  cpuacct  cpu,cpuacct  cpuset  devices  freezer  hugetlb  memory  net_cls  net_cls,net_prio  net_prio  perf_event  pids  rdma  systemd  unified

Spinning up docker containers on the host will create an entry in some of these directories for each container with the details of the resources attached. This looks as follows.

$ cat /sys/fs/cgroup/pids/docker/b80255b4f42f0603eff01be1472fe9e561e1ee2f7a584f57fcf4ed30ca5e4156/pids.max

max

$

As we notice in the preceding excerpt, the pids.max file contains the value max for a specific container whose id is starting with b80255.

This can be controlled when starting the container. The following example shows how the number pids can be limited to a specific number, which is 6 in this case.

$ docker run -itd --pids-limit 6 alpine

Checking the cgroup entry of this new container shows the following.

$ cat /sys/fs/cgroup/pids/docker/abfccb1e25e9dc57698800e53ca277f51c7cc0632734a2cf17a27cd10f14d620/pids.max

6

$

As we can notice, the pids for this container are now limited to 6. Similarly, we can control other resources such as CPU, Memory and IOPS.

Capabilities

Root users in Linux are very special and they have superpowers. This means root users have more privileges than a normal user in the Linux environment. If we break all these superpowers into distinct units, they become capabilities. Almost all the superpowers associated with the root user are broken down into individual capabilities. Being able to break down these permissions allows us to have granular control over controlling what root users can do.

This means we can make the root user less powerful and it is also possible to provide more powers to the standard user at a granular level. By default, Docker drops all capabilities except those needed using a whitelist approach. We can use Docker commands to add or remove capabilities to or from the bounding set. The following command can be used to list the default bounding set of an alpine container.

/# casph --print

If capsh command is not available, it can be installed using apk add -U libcap.  

If a container is started using --privileged flag, the default bounding set will be overridden and all the capabilities will be assigned to a container.

Docker allows us to drop or add specific capabilities to a container. The following example shows how a specific capability can be dropped.

$ docker run -it --cap-drop CHOWN alpine sh

As we can notice, the capability CHOWN is dropped from the container.

The following example shows how all capabilities can be removed and a specific capability can be added.

$ docker run -it --cap-drop ALL --cap-add chown alpine sh

As we can notice, the capability CHOWN is added and all the other capabilities are dropped from the container.  This is how we can make use of capabilities to have granular control on what privileges the root accounts can have.

Seccomp

Secure computing mode (seccomp) is a Linux kernel feature that we can use to restrict the actions available within the container.  Seccomp can be used to filter what system calls can be run from within the container. Docker when built with seccomp and on supported host operating systems, comes with a default seccomp profile. The default seccomp profile provides a sane default for running containers with seccomp and disables around 44 system calls out of 300+. The default seccomp profile can be found here.

Many of the syscalls disabled by this seccomp profile are also gated by various capabilities that are disabled by default for root users in containers. This means, even if the seccomp profile is disabled, the removed capabilities will prevent the majority of these operations disabled by the seccomp profile thus providing an extra layer of protection.

When starting a new Docker container, we can override the default seccomp profile with a profile of our choice as shown below.

$ docker run -itd --security-opt seccomp=seccomp-profile.json alpine

Here, we have loaded a custom seccomp profile, which has the following contents.

{

        "defaultAction": "SCMP_ACT_ALLOW",

        "architectures": [

                "SCMP_ARCH_X86_64",

                "SCMP_ARCH_X86",

                "SCMP_ARCH_X32"

        ],

        "syscalls": [

                {

                        "name": "chmod",

                        "action": "SCMP_ACT_ERRNO",

                        "args": []

                }

        ]

}

The preceding seccomp profile is a relaxed profile by allowing all syscalls from the container by default except for chmod, which is explicitly blocked.

The following command shows how one can spin up a docker container without loading the default seccomp profile.

$ docker run --rm -it --security-opt seccomp=unconfined alpine sh

We can confirm that we are not running with the default seccomp profile anymore by running an unshare command, which runs a shell process in a new namespace. This looks as follows.

$ docker run --rm -it --security-opt seccomp=unconfined alpine sh

/ # unshare --map-root-user --user

456dee225040:/# whoami

root

456dee225040:/# 

For the record, we cannot run an unshare command when the default seccomp profile is loaded. It looks as follows if attempted. 

$ docker run --rm -it alpine sh

/ # unshare --map-root-user --user

unshare: unshare(0x10000000): Operation not permitted

/ # 

Device and file restrictions, AppArmor

AppArmor (Application Armor) is a Linux security module that protects an operating system and its applications from security threats. AppArmor is not built for Docker but it's a Linux security tool. Since Docker makes use of Linux kernel, AppArmor can also be used with Docker containers. AppArmor profiles are applied on file system paths to apply restrictions on files being accessed. To use it with Docker, we need to associate an AppArmor security profile with each container. So when we are starting a container, we have to provide a custom AppArmor profile to it and Docker expects to find an AppArmor policy loaded and enforced.

Docker comes with a default profile for containers named docker-default.  The following command can be used to load an AppArmor profile while starting a container.

$ docker run -itd --security-opt apparmor=apparmor-profile alpine

Conclusion

This article has provided an overview of how docker leverages various Linux kernel features, which include namespaces, cgroups and capabilities. We discussed how capabilities can be added or removed when spinning up a container. In addition to it, we discussed how AppArmor and seccomp profiles can be leveraged to improve the overall security of containers.

It should be noted that docker comes with default AppArmor and seccomp profiles, which can be overridden.

Sources

https://docs.docker.com/engine/security/seccomp/

https://docs.docker.com/engine/security/apparmor/

https://docs.docker.com/engine/security/

Srinivas
Srinivas

Srinivas is an Information Security professional with 4 years of industry experience in Web, Mobile and Infrastructure Penetration Testing. He is currently a security researcher at Infosec Institute Inc. He holds Offensive Security Certified Professional(OSCP) Certification. He blogs atwww.androidpentesting.com. Email: srini0x00@gmail.com