This post describes a breaking
change
in runc
v1.0.0-rc93,
that has subsequently had a workaround
implemented that will
presumably be included in
v1.0.0-rc94. Thanks to
@haircommander for talking through the
issue with me and implementing the subsequent workaround, and to
@mattomata for his consultation on the
distroless/static:nonroot
behavior. If you are not
interested in the background of the issue, you can skip reading this post and
take a look at my detailed testing scenarios on the Crossplane
repo, or my breakdown of
the conflict with the nonroot
image on the distroless
repo.
Recently, a Crossplane user reported that they had
upgraded to the latest version of Openshift and
their Pods
started immediately going into
CrashLoopBackoff
.
They went back and verified that the same version of Crossplane was running
successfully on the older Openshift version, but now the containers were exiting
with:
level=error msg="container_linux.go:366: starting container process caused: chdir to cwd (\"/home/nonroot\") set in config.json failed: permission denied"
Fortunately, though this error is quite verbose for someone who is not familiar with container runtime internals, it is extremely helpful for a developer as it provides the file name, line number, and explicit expression of the action that caused the error. Most Kubernetes platforms these days use containerd as their “high-level” container runtime, and runc as their “low-level” container runtime. I won’t go into too much detail on the responsibilities of each (I recommend you check out this blog series by Ian Lewis), but an extremely simple explanation is that containerd orchestrates containers whose lifecycle is managed by runc. This error has to do with actually starting the container, which is a dead giveaway that we are dealing with runc, but if you weren’t sure it could be confirmed by looking at the source of container_linux.go.
A container is essentially just a Linux process that is sandboxed using namespaces and cgroups. This means that they have the same properties that any other process has, including state, various identifiers, scheduling criteria, and more (if you want to learn more about all of the attributes of a process, taking a look at the task_struct in the Linux Kernel source is a great adventure). The OCI Image Specification allows you to define some of these properties, which are then enforced by the container runtime, in this case runc.
Unfortunately (or fortunately, depending on how you look at it), these
attributes can be defined at various levels of the container orchestration
stack, notably build time and run time. For example, in a
Dockerfile you can define a
USER
,
WORKDIR
, and
ENTRYPOINT
,
but you can also override
them
when running a container using the image you built. Similarly, Kubernetes
exposes some of these attributes in the Pod
specification through fields such
as
runAsUser
, runAsGroup
, and workingDir
. This allows for maximum
flexibility, but also can lead to a rocky user experience when building widely
distributed images or running 3rd party images in your cluster.
So back to the original issue: the parameters used to run the Crossplane
containerized processes had not been changed, but runc was now failing to
start them. Once again looking at the error message, we can see that runc is
unable to chdir
to the cwd
(or WORKDIR
in Dockerfile
parlance). When a
container is started, runc must identify the cwd
that is specified in the
image manifest, or overridden by the container orchestrator, and switch to that
directory before actually starting the process. Since this step was now failing,
there must have been a change in how runc executes it.
Taking a look at this issue in the Red Hat bug
tracker, we can see that a
recent
change
in runc switched from executing the chdir
step with the UID
of runc to
the UID
specified by the container image (or overridden by the container
orchestrator). But why was this change primarily causing problems for images
built on the distroless/static:nonroot
base
image? To understand, we
must first take a look at how the base image is built.
The purpose of the distroless project is to provide base images that have the
minimum components required for an application to run. Many common base images,
such as alpine, are stripped of much of the
cruft of a full Linux distribution, but there are still many tools and utilities
that are unneeded and expose unnecessary attack vectors if a container was
compromised. So distroless just packages the bare
essentials,
which for a Go application are packages such as ca-certificates
and tzdata
.
The distroless project also offers a nonroot
variant of this bare bones image,
which is essentially the same as static
, but sets the
user
to nonroot
(UID = 65532
), and the working directory to /home/nonroot
.
Note: these images are built with Bazel, but you can get a good idea of how the
.bzl
files map to aDockerfile
without requiring a deep understanding of the build system just by looking at the docker rules.
The /home/nonroot
directory gets created with 0700
permissions, which
means that the owning
user (nonroot
) has Read
, Write
, and Execute
permissions on the
directory, but other users, even those in the same group, have no permissions.
In order to chdir
, the calling process must have Execute
permissions on the
target directory. In this case only UID 65532
has those permissions, but the
nonroot
image sets the user for us so that shouldn’t be an issue right?
As mentioned before, the build time properties are sometimes overridden at run
time, and on Openshift, this always
happens unless
explicitly overridden. When a Namespace
is created on an Openshift cluster, it
is given a range of UIDs and GIDs, and each Pod
that gets deployed in the
Namespace
is assigned the first UID and GID in the range. However, this is not
new functionality in Openshift, so if the 65532
UID was already being
overridden in older versions, why were the containers able to be started
successfully?
The reason lies in the fact that the newer version of Openshift had upgraded to
runc v1.0.0-rc93
which included the aforementioned change to chdir
with
the container user rather than the user for the runc process itself. Unless
running in rootless
mode, runc runs
as the root
user, which typically has special privileged
capabilities, such
as CAP_FOWNER
, which allow it bypass permissions. So though the older
Openshift version was still using a “random” UID, since the underlying runc
version was executing chdir
as root, it was not resulting in a permission
denied error.
Whether this change is a feature or a bug is mostly a matter of how one
interprets the responsibilities of a container runtime. However, regardless of
it being “correct” or not, it was a breaking change to a rather important
component of the container ecosystem, which has led to a subsequent
workaround being introduced
that will presumably be included in
v1.0.0-rc94. As for
current users of the nonroot
base image, though I have detailed the situation
in an issue on the distroless
repo, I would
not expect (nor recommend) that a change be made to the image configuration. The
purpose of the image is to enforce the use of the single nonroot
user, so
making it possible not to breaks the original intention. Instead, if you agree
with Red Hat’s
assertion (see
the “Traditional Applications and UIDs” section) that an application should not
“expect a specific UID under which they will be running”, a viable alternative
is to use the plain distroless/static
image and manually set UID = 65532
in
your image build, effectively saying that “you will run as nonroot
by default,
but you are free to override if you know what you’re doing”. This is the
direction we have gone in
the most recent Crossplane
release.
Closing Thoughts Link to heading
Hopefully this post serves as a helpful guide for other folks that encounter this issue. I want to give a huge shout out to all the folks who work on projects like runc and containerd, which are the foundation for much of the infrastructure of the cloud native landscape. When there are changes that break existing behavior I encourage folks to respond with grace and constructive feedback. Often times, as is true with this change, the new behavior is in the interest of protecting users, even if it can cause short term headaches.
Send me a message @hasheddan on Twitter for any questions or comments!