LXC & Docker Containerization

Setting Up an Unprivileged LXC Container with AppArmor on Ubuntu 22.04

The Problem: I was beyond frustrated; when I started an LXC container as root, I was always mindful of the possibility that one buggy program inside it could compromise my whole host. That anxiety of being able to lose the use of the entire server due to a container escape plagued me.

The Constraints: My goal was to be able to use containers for testing, but I did not want any instances of CAP_SYS_ADMIN associated with them. My host was a hardened Ubuntu 22.04 server, and I needed AppArmor to provide a real level of protection as opposed to just being another checkmark on compliance paperwork.

The Solution: I have now fully understood how user namespace mappings allow the creation of unprivileged containers. Once I recognized that the UID of the root user in the container is a “nobody” UID number from an extremely high range of UIDs on the host, the fear of container escape became obsolete. Creating a custom AppArmor profile for the container allowed me to secure everything in it.

Quick Summary:

  • An unprivileged LXC container will be created as an ordinary user.
  • The container’s root user UID (0) will be mapped to an extremely high UID on the host, so that if there is a root escape, the only files visible to the escaping process will be its unprivileged files.
  • An LXC auto-generated, strict AppArmor profile will be applied to the created containers and enforced upon each container.
  • The cgroups v2 resource limits will be set without requiring a daemon running as root.
  • The configuration error where a deprecated configuration file silently provides a conduit for lack of confinement will be debugged.

Tested OS & Environment: Ubuntu 22.04.3 LTS with kernel version 5.15.0-91-generic, LXC version 5.0.1, AppArmor version 3.0.4. All steps below were performed on this setup.

Prerequisites & Planning

System Readiness: Kernel and User Namespace Support for Containers

The linux kernel will create user namespaces; before the introduction of user namespaces, containers could only exist as privileged containers. To enable user namespaces, the kernel must support them; separately, we also need cgroups v2 mounted. Let’s verify both.

In order to verify that your kernel supports user namespaces, you should run the following command:

unshare --user --pid echo "User namespaces work"

If you receive the output above without seeing any permission errors, you are ready to continue with the next steps. Next, verify that cgroups v2 is mounted to /sys/fs/cgroup. For more information about this mounting process, see the kernel cgroup v2 documentation, which explains the single hierarchy structure we use.

mount | grep cgroup2

You should see an entry similar to the following: cgroup2 on /sys/fs/cgroup type cgroup2; if you do not see it, your kernel may be booted with the parameter systemd.unified_cgroup_hierarchy=0. Avoid this situation. Now, install the necessary packages (you must be logged in as root to do so, since these are system packages).

sudo apt update && sudo apt install -y lxc apparmor-profiles uidmap

The uidmap package provides you with both newuidmap and newgidmap, which are used by LXC to create user namespaces for unprivileged containers.

Verification: Run lxc-checkconfig as your normal user and ensure that you see an entry stating User namespace: enabled. If it states missing, check the kernel configuration to verify that CONFIG_USER_NS=y.

Mapping UID and GID Ranges for Unprivileged Containers

When using unprivileged containers, when they run, they run as their own root inside the container (UID 0); therefore, this root must be mapped to a user ID other than root on the host. This mapping will be defined in /etc/subuid and /etc/subgid.

Before modifying either of these configuration files, you should create backup copies of the existing files. To do this, run the following commands: sudo cp /etc/subuid /etc/subuid.bak and sudo cp /etc/subgid /etc/subgid.bak.

While editing this configuration file, maintain a second root shell to avoid losing your changes due to unforeseen issues.

Add a new line to your configuration for your user that will allocate a range of 65,536 UIDs, starting from 1,000,000. While this allocation may seem large, it will not cause any problems.

myuser:1000000:65536

Repeat the above steps for the file /etc/subgid.To map the container UID 0 to 1,000,000 on the Host UID and incrementing that pattern for all the container’s UIDs, LXC will make use of the lxc.idmap. Anyone who breaks out of a container (i.e., as root) will not have the full privileges of a root user on the Host system but will be treated as just another user without any special privileges.

Checking your User ID: Since we are running as normal users, use grep $USER /etc/subuid to confirm that you have the correct ID map record in the subuid file. Also, cat /proc/self/uid_map will show a line like 0 1000 1 or something similar. After the container has started running, checking uid_map for that container would show a much greater range of UID mappings.

Core Setup / Architecture Breakdown

Creating an Unprivileged LXC Container with AppArmor Profile on Ubuntu

We are going to create this container in our normal user space with the help of the download example template. Note that the -- separates the arguments which are defined by LXC from those that may be found in the download example template.

lxc-create -t download -n secure-01 -- -d ubuntu -r jammy

The following path has already been created where we will find the new container’s configuration file at ~/.local/share/lxc/secure-01/config. Open this file in a text editor and scroll down to the lxc.apparmor.profile line. Set it to generated. This tells LXC we want a custom, auto-generated AppArmor profile rather than the default. According to the lxc.container.conf(5) man pagelxc.apparmor.profile = generated triggers LXC to build a restrictive profile and automatically load it when the container starts.

Add this line near the top of the config file:

lxc.apparmor.profile = generated

Now we can start the container in the foreground, capturing the log to see if the profile is loaded.

lxc-start -n secure-01 -F --logfile /tmp/lxc-secure.log 2>&1 | head

Now, we search through this data until we find a line that says something like, lxc-secure-01_—this will be the automatically generated name for the AppArmor profile.

lxc-start secure-01 20240508120000.123 INFO conf - conf.c:run_script_argv:372 - Executing script ... lxc-secure-01_

Enabling and Verifying AppArmor Enforcement

If we determine that the AppArmor profile has been correctly created, then we can load it into the container manually to ensure that enforcement is in place.You can find the profile file from the container’s profile name. This will include a hash and will change for each container instance. The profiles are stored under the path: ~/.local/share/lxc/secure-01/apparmor/.

sudo apparmor_parser -r ~/.local/share/lxc/secure-01/apparmor/lxc-secure-01_*

After creating and starting the new container, you can check the AppArmor status of the container.

lxc-start -n secure-01
sudo aa-status | grep lxc-secure-01

The output should be similar to:

  lxc-secure-01_<hash> (enforce)

If you started the container in complain mode, the status would read (complain), while if it’s running in enforce mode any access to system calls that are not defined in the profile will be denied. In addition to checking the state of AppArmor, you will see a detailed log that specifies exactly which profile was applied. You can then validate that it is correct.

Validating UID Mapping and Process Isolation

Once inside of the container, you can check the UID mapping for the PID 1 process.

lxc-attach -n secure-01 -- cat /proc/1/uid_map

The output should always start with 0 1000000 65536. The first value (0) is the root UID for the container. Therefore, the root of the container (0) corresponds to the host UID of 1,000,000. In other words, there is a 1,000,000 UID gap between the root user of the container and the root user of the host. Running ps auxZ | grep init on the host will show that the container initiated process has a UID of 1,000,000, rather than the real root.

         0 1000000 65536     <-- container root = host UID 1000000

Validating: Running ls -l /proc/$(pgrep -f "lxc.init"|head -1)/ns/user from the host shows that the container’s process exists in a separate user namespace.

Optimization / Best Practices

Customizing AppArmor Profiles for Application-Specific Security

The auto-generated AppArmor profile is a good starting point, but for real applications to function correctly, the AppArmor profile can be enhanced further. You can extend the profile located in the path: ~/.local/share/lxc/secure-01/apparmor/lxc-secure-01_* without interfering with LXC.If your webserver runs inside a container on port 8080 and serves files from the location /var/www, you will also need to include:

  network inet stream,
  /var/www/ r,
  /var/www/** rwk,

Avoid using the broad “capability sys_admin” feature; it goes against the purpose of using AppArmor in the first place. Furthermore if you later bind-mount a host directory into their container; the rules will also need to match (this will be covered in the next section). Once you have made changes, use the command “apparmor_parser -r” to apply your changes to the AppArmor profile, and restart your container to apply your changes to the system.

Enabling cgroups v2 Resource Limits for Unprivileged Containers

With cgroups v2 mounted within an unprivileged container, an unprivileged container can independently set its own resource limits without going through a privileged daemon. LXC exposes this capability using the lxc.cgroup2.* keys. A full list of controllers available to cgroups v2 is provided in the kernel cgroup v2 documentation.

In order to set the maximum resource limits on a container, the following statements should be added to the container configuration file.

lxc.cgroup2.memory.max = 512M
lxc.cgroup2.cpu.max = 200000 100000  # 2 CPUs worth of time in a 100ms period

You can verify that your settings are being enforced within the container using “systemd-cgtop” (if systemd is running inside the container) or directly by running “cat /proc/self/cgroup”.

cat /sys/fs/cgroup/memory.max

Inside the container, the maximum amount of memory will be shown as “536870912 (512 MiB)”, that is the maximum amount of memory that can be allocated by the kernel to the container, and the container cannot burst over that limit. This is the cleanest way to ensure that your host will never become starved by a noisy neighbour.

Verification: Use “systemd-cgtop” inside your container to verify that the memory and CPU usage is being accounted for correctly.

Debugging Real‑World Setup Issues

What Didn’t Work For Me

Here is one of those “why did this happen to me?” problems that took me an entire afternoon to figure out. I created a new container, assigned it “lxc.apparmor.profile = generated”, and started the container. After that I checked the AppArmor profile using “aa-status”; it was still listed as being in “complain” mode even though I never explicitly asked for that mode.While AppArmor was running normally, the AppArmor component only logged activities rather than actually blocking any activities that it would have normally blocked. As I began reviewing the logs at startup time, I noticed the following inside the logs:

lxc-start secure-01 ... WARN conf - conf.c:apparmor_load:366 - Invalid profile 'lxc-default' ... falling back to complain mode

Why am I seeing lxc-default? I never referenced it in any form prior to this. As I continued to look for the source in my config file of the container I noticed the following:

lxc.apparmor.profile = generated
lxc.aa_profile = lxc-default     <-- REMOVE THIS DEPRECATED LINE

The lxc.aa_profile which I was previously using associated to the first configurations I made regarding AppArmor was still in my config file as part of a previous test I had done. In essence, both settings were still present in my configuration file which caused the legacy value to be applied to LXC when loading and the first used value had been discarded. Due to the attempted application of applying a profile that did not exist, I was placed into the complain mode of AppArmor (safest way to load profile, but the way I had been tried to load the profile was in fact not correct). My running of the lxc-checkconfig program confirmed my prior assumption regarding that lxc.aa_profile was indeed still present in my configuration with an associated warning that it was misconfigured based on the provided configuration. The remedy for this error was very simple: Delete the line, re-apply the AppArmor profile through sudo apparmor_parser -r ~/.local/share/lxc/secure-01/apparmor/lxc-secure-01_*, then restart the container. When I ran aa-status, I found that it now showed (enforce). The moral of the story is to make sure that you never specify both lxc.aa_profile and lxc.apparmor.profile.

Edge Cases and Post‑Setup Pitfalls

Resolving Denied Mounts due to Overlapping AppArmor Rules

When you bind-mount directories from the host file system, AppArmor may not permit the mount call to occur though the container itself has no permissions on the host. The automatically generated AppArmor profile will not include sufficient mount permission rules. If you are doing a bind-mount to/ext4, you will need to create an explicit rule. The Arch Wiki LXC page has a section on bind-mount failures that helped me out considerably.

Insert into your AppArmor profile:

mount fstype=ext4 options=(rw, bind) /host/path/ -> /container/path/,

Remember that the host path is what provides the corresponding mount point to your container. After making the changes to your configuration, you will need to apply the AppArmor profile with sudo apparmor_parser -r command, and stop/start your container. Should you still see denials occurring, you can check the dmesg file to determine which specific file operations were not permitted by AppArmor via the apparmor="DENIED" entries written to your dmesg file.

Frequently Asked Questions

Why does my unprivileged container still have access to host systemctl or systemd services?

Running a systemd process inside an unprivileged container will not permit the host’s PID 1 to communicate with that process, nor the container’s. The host’s D-Bus socket used by systemctl will also be isolated to the container. There are ways to mitigate danger from accidently restarting any host services.

How can I migrate an existing privileged container to unprivileged without data loss?

You will need to first stop the container, then either use lxc-usernsexec or execute a manual rootfs ownership change using the command chown -R 1000000:1000000 /var/lib/lxc/OldName/rootfs. However, the safest route would be to create an entirely new unprivileged container and then rsync the original’s rootfs to it, followed by adapting the unprivileged container config file from the privileged one. The LXC upstream documentation covers the full source of information for implementing a shift and there are lots of ways to mess up the file capabilities (so be ready to make a full backup).

What are the AppArmor profile differences between lxc-default and lxc-default-with-nesting, and when should I use which?

lxc-default was designed for to implement normal single-layer containers. lxc-default-with-nesting relaxed various settings around mount, pivot_root and namespaces to enable you to run LXC inside of LXC. You should only employ this form of AppArmor exception when you specifically need to nest containers, and always pair it with unprivileged containers in order to minimize the blast radius. For the majority of cases, generated is a far more targeted way to execute LXC.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button