chdir to cwd: permission denied

This post describes a breaking change in runc v1.0.0-rc93, that has subsequently had a workaround implemented that will presumably be included in v1.0.0-rc94. Thanks to @haircommander for talking through the issue with me and implementing the subsequent workaround, and to @mattomata for his consultation on the distroless/static:nonroot behavior. If you are not interested in the background of the issue, you can skip reading this post and take a look at my detailed testing scenarios on the Crossplane repo, or my breakdown of the conflict with the nonroot image on the distroless repo.

Recently, a Crossplane user reported that they had upgraded to the latest version of Openshift and their Pods started immediately going into CrashLoopBackoff. They went back and verified that the same version of Crossplane was running successfully on the older Openshift version, but now the containers were exiting with:

level=error msg="container_linux.go:366: starting container process caused: chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"

Fortunately, though this error is quite verbose for someone who is not familiar with container runtime internals, it is extremely helpful for a developer as it provides the file name, line number, and explicit expression of the action that caused the error. Most Kubernetes platforms these days use containerd as their “high-level” container runtime, and runc as their “low-level” container runtime. I won’t go into too much detail on the responsibilities of each (I recommend you check out this blog series by Ian Lewis), but an extremely simple explanation is that containerd orchestrates containers whose lifecycle is managed by runc. This error has to do with actually starting the container, which is a dead giveaway that we are dealing with runc, but if you weren’t sure it could be confirmed by looking at the source of container_linux.go.

A container is essentially just a Linux process that is sandboxed using namespaces and cgroups. This means that they have the same properties that any other process has, including state, various identifiers, scheduling criteria, and more (if you want to learn more about all of the attributes of a process, taking a look at the task_struct in the Linux Kernel source is a great adventure). The OCI Image Specification allows you to define some of these properties, which are then enforced by the container runtime, in this case runc.

Unfortunately (or fortunately, depending on how you look at it), these attributes can be defined at various levels of the container orchestration stack, notably build time and run time. For example, in a Dockerfile you can define a USER, WORKDIR, and ENTRYPOINT, but you can also override them when running a container using the image you built. Similarly, Kubernetes exposes some of these attributes in the Pod specification through fields such as runAsUser, runAsGroup, and workingDir. This allows for maximum flexibility, but also can lead to a rocky user experience when building widely distributed images or running 3rd party images in your cluster.

So back to the original issue: the parameters used to run the Crossplane containerized processes had not been changed, but runc was now failing to start them. Once again looking at the error message, we can see that runc is unable to chdir to the cwd (or WORKDIR in Dockerfile parlance). When a container is started, runc must identify the cwd that is specified in the image manifest, or overridden by the container orchestrator, and switch to that directory before actually starting the process. Since this step was now failing, there must have been a change in how runc executes it.

Taking a look at this issue in the Red Hat bug tracker, we can see that a recent change in runc switched from executing the chdir step with the UID of runc to the UID specified by the container image (or overridden by the container orchestrator). But why was this change primarily causing problems for images built on the distroless/static:nonroot base image? To understand, we must first take a look at how the base image is built.

The purpose of the distroless project is to provide base images that have the minimum components required for an application to run. Many common base images, such as alpine, are stripped of much of the cruft of a full Linux distribution, but there are still many tools and utilities that are unneeded and expose unnecessary attack vectors if a container was compromised. So distroless just packages the bare essentials, which for a Go application are packages such as ca-certificates and tzdata. The distroless project also offers a nonroot variant of this bare bones image, which is essentially the same as static, but sets the user to nonroot (UID = 65532), and the working directory to /home/nonroot.

Note: these images are built with Bazel, but you can get a good idea of how the .bzl files map to a Dockerfile without requiring a deep understanding of the build system just by looking at the docker rules.

The /home/nonroot directory gets created with 0700 permissions, which means that the owning user (nonroot) has Read, Write, and Execute permissions on the directory, but other users, even those in the same group, have no permissions. In order to chdir, the calling process must have Execute permissions on the target directory. In this case only UID 65532 has those permissions, but the nonroot image sets the user for us so that shouldn’t be an issue right?

As mentioned before, the build time properties are sometimes overridden at run time, and on Openshift, this always happens unless explicitly overridden. When a Namespace is created on an Openshift cluster, it is given a range of UIDs and GIDs, and each Pod that gets deployed in the Namespace is assigned the first UID and GID in the range. However, this is not new functionality in Openshift, so if the 65532 UID was already being overridden in older versions, why were the containers able to be started successfully?

The reason lies in the fact that the newer version of Openshift had upgraded to runc v1.0.0-rc93 which included the aforementioned change to chdir with the container user rather than the user for the runc process itself. Unless running in rootless mode, runc runs as the root user, which typically has special privileged capabilities, such as CAP_FOWNER, which allow it bypass permissions. So though the older Openshift version was still using a “random” UID, since the underlying runc version was executing chdir as root, it was not resulting in a permission denied error.

Whether this change is a feature or a bug is mostly a matter of how one interprets the responsibilities of a container runtime. However, regardless of it being “correct” or not, it was a breaking change to a rather important component of the container ecosystem, which has led to a subsequent workaround being introduced that will presumably be included in v1.0.0-rc94. As for current users of the nonroot base image, though I have detailed the situation in an issue on the distroless repo, I would not expect (nor recommend) that a change be made to the image configuration. The purpose of the image is to enforce the use of the single nonroot user, so making it possible not to breaks the original intention. Instead, if you agree with Red Hat’s assertion (see the “Traditional Applications and UIDs” section) that an application should not “expect a specific UID under which they will be running”, a viable alternative is to use the plain distroless/static image and manually set UID = 65532 in your image build, effectively saying that “you will run as nonroot by default, but you are free to override if you know what you’re doing”. This is the direction we have gone in the most recent Crossplane release.

Closing Thoughts Link to heading

Hopefully this post serves as a helpful guide for other folks that encounter this issue. I want to give a huge shout out to all the folks who work on projects like runc and containerd, which are the foundation for much of the infrastructure of the cloud native landscape. When there are changes that break existing behavior I encourage folks to respond with grace and constructive feedback. Often times, as is true with this change, the new behavior is in the interest of protecting users, even if it can cause short term headaches.

Send me a message @hasheddan on Twitter for any questions or comments!