go-criu v8.3.0 switches to protobuf-go-lite, which helps to remove
google.golang.org/protobuf dependency from here, reducing the runc
binary size from ~16M to ~14M.
The only missing piece is proto.String, proto.Bool, proto.Int32 etc.
helpers that return a pointer to a given variable. Those are replaced
by a generic mkPtr, which in turn is to be replaced by the new builtin
once Go < 1.26 is no longer supported.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As with /dev symlinks, this was missed in commit d40b3439a9 ("rootfs:
switch to fd-based handling of mountpoint targets"). It's not really
clear to what extent this was exploitable (/sys/fs/cgroup is a tmpfs we
create) but it's better to just fix this anyway.
Fixes: d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
These codepaths are very old and operate on pure paths but before
pivot_root(2), meaning that a bad image with a malicious /dev symlink
could cause us to operate on host paths instead.
In practice this means that we could be tricked into removing a file
called "ptmx" (note that /dev/pts/ptmx and /dev/ptmx are both immune for
different reasons) or creating a very restricted set of symlinks (with
fixed targets and names). The scope of these bugs is thus quite limited,
but we definitely need to harden against it.
These codepaths were unfortunately missed during the fd-based rework in
commit d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint
targets") -- I must've assumed they were called after pivot_root(2)...
Fixes: GHSA-xjvp-4fhw-gc47
Fixes: CVE-2026-41579
Fixes: d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
This argument order matches most other helpers we have and will also
match the changes we are about to make to setupPtmx and
setupDevSymlinks.
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
Replace the goroutine + channel + 100ms time.After + blocking open
in handleFifo with a poll(2) loop on a non-blocking open. Use
pidfd_open(2) where available to wait for init exit without timeout,
falling back to /proc state checks with 100ms timeout on older
kernels.
Fixes#5251
Signed-off-by: Mohammed Aminu Futa <mohammedfuta2000@gmail.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
The test checked for the exact BusyBox ash diagnostic "sh: can't fork".
With BusyBox 1.38, ash reports the failure as:
/bin/sh: line 0: can't fork: Resource temporarily unavailable
Match the stable "can't fork" part of the error message instead.
Signed-off-by: Ricardo Branco <rbranco@suse.de>
TestPids used long hand-written /bin/true pipelines for the 4-, 32- and
64-command cases. This made the test easy to typo and hard to review, as
seen by the earlier "bin/true" entries.
Build the shell pipelines instead, preserving the existing test coverage
while making the command counts explicit.
Signed-off-by: Ricardo Branco <rbranco@suse.de>
Close the root file descriptor immediately after use in maskPaths to
reduce the window during which an attacker could potentially exploit
an open fd to access or manipulate the root filesystem. This follows
the principle of least privilege and mitigates risks in compromised
or malicious container scenarios.
Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
This is a follow-up to #5275. That change reused a single tmpfs mount
to mask multiple directories, which is efficient when masking more than
one path. However, it introduced unnecessary overhead when only one
directory is masked. This commit restores the original behavior for the
single-path case while preserving shared tmpfs logic for multiple paths.
Signed-off-by: lifubang <lifubang@acmcoder.com>
Switch from sync.Once to sync.OnceFunc and sync.OnceValues.
Keep Root a function (rather than a variable) because godoc
renders function doc better than a variable doc.
Switch to using internal function root internally.
Modify tests accordingly (and simplify NewIntelRdtTestUtil to
fakeRoot).
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The whole struct intelRdtStatus with its methods and a sync.Once is not
needed, since intelrtd.Is*Enabled methods are already run-once (or use
run-once and a simple comparison).
Yet it is still needed for the test to fake values returned by *Enabled.
Simplify to use func pointers which a test case overwrites.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
It is not doing anything, and tests can just instantiate the &Manager{}.
Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
1. Use sync.OnceValue.
2. Fix the len(buf) check -- we only need 1 byte. Real kernel output
is "Y\n" so practically this change is a no-op.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Kubernetes may add one sysfs thermal_throttle entry per CPU to
maskedPaths. On large Intel systems this can produce many directory
masks for a single container. runc currently handles each directory
mask with a separate read-only tmpfs mount, and therefore a separate
tmpfs superblock.
On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs
superblocks can contend on the global shrinker_rwsem when containers
start or stop concurrently.
Use one read-only tmpfs for directory masks and bind-mount it over the
remaining directory targets. The first non-procfs-fd directory mount is
reopened through the container root fd before it is reused. File masks
still bind /dev/null, and procfs fd targets keep the existing
one-tmpfs-per-target behaviour because they are fd aliases rather than
stable rootfs paths.
If the bind-mount of the shared source fails (e.g. due to kernel
restrictions), fall back to individual tmpfs mounts for all remaining
directories. Tmpfs mounts use nr_blocks=1,nr_inodes=1 to minimise
kernel resource usage.
The bind mounts do not create additional tmpfs superblocks. They also
retain the read-only mount flag inherited from the source vfsmount, so
the masking semantics remain unchanged.
xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018
Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Refactored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
Previously, masked directories (e.g., /proc/acpi, /proc/scsi) were
mounted as read-only tmpfs without explicit size or inode limits.
Although these mounts are meant to be empty and unwritable, the lack
of resource constraints means that—should an attacker bypass the
read-only protection (e.g., via container escape, mount namespace
manipulation, or a kernel vulnerability)—the tmpfs could consume up
to 50% of system memory by default (the kernel's default tmpfs limit).
To mitigate this risk in high-density container environments and
adhere to the principle of least privilege, we now explicitly set:
- nr_blocks=1 (sufficient for at most one block size)
- nr_inodes=1 (sufficient for at most one inode)
Ref: https://man7.org/linux/man-pages/man5/tmpfs.5.html
These limits ensure that even if compromised, kernel memory usage
remains strictly bounded and negligible.
This change aligns with best practices used by other container
runtimes and strengthens defense-in-depth for sensitive masked paths.
Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Refactored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
Before commit 7dc24868, when process.env was nil, prepareEnv
returned a flag telling HOME is not set, and it was added.
Commit 7dc24868 moved the functionality of adding HOME into
prepareEnv but did not properly handle nil case. As a result,
runc exec -p with process.json having no env set resulted in
an exec with no HOME set.
Fix this, and add unit and integration tests.
Fixes: 7dc24868 ("libct: switch to numeric UID/GID/groups")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since commit 3cdda46 the poststart hooks runs after the container
process start, and so they race.
Move the poststart hook check to a separate step after the container
process has exited.
Fixes 5245.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This test is racy for a long time now. All the logs I could find in CI
seem to be dangling symlinks, like the test shows "23 -> ". This means
the fd was closed before we did the call to readlink().
Let's try to disable the GC. This should get rid of the "fds are getting
closed before we read them" part.
Updates: #4297
Signed-off-by: Rodrigo Campos <rodrigo@amutable.com>
By default, readlink is silent about any errors. Make it verbose so we
can better interpret any test failures.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When running a process inside a container, make sure its stderr is not
nil (except for some trivial cases like cat). Modify waitProcess to show
failed command's stderr, if possible.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Since Wait returns an ExitError if process' exit status is not 0,
checking process status is redundant and this code is never reached.
Remove it.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When rootfsPropagation is set to rslave, prepareRoot() was forcing the
rootfs parent mount to MS_PRIVATE before bind-mounting and pivoting into
the rootfs. That breaks the slave relationship needed for HostToContainer
propagation, so later unmount/remount events on host mountpoints under
the rootfs are not reflected inside the running container.
Fix this by keeping the rootfs parent mount as MS_SLAVE for slave-like
rootfs propagation settings, while leaving the final root propagation
remount in place.
Signed-off-by: sean <xujihui1985@gmail.com>
These helpers all make more sense as a self-contained package and moving
them has the added benefit of removing an unneeded libpathrs dependency
(from libcontainer/utils's import of pathrs-lite) from several test
binaries.
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
This allows users to automaticaly migrate to the new location
using `go fix`. It has some limitations, but can help smoothen
the transition; for example, taking this file;
```
package main
import (
"github.com/opencontainers/runc/libcontainer/devices"
)
func main() {
_, _ = devices.DeviceFromPath("a", "b")
_, _ = devices.HostDevices()
_, _ = devices.GetDevices("a")
}
```
Running `go fix -mod=readonly ./...` will migrate the code;
```
package main
import (
devices0 "github.com/moby/sys/devices"
)
func main() {
_, _ = devices0.DeviceFromPath("a", "b")
_, _ = devices0.HostDevices()
_, _ = devices0.GetDevices("a")
}
```
updates b345c78dca
Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
The runtime-spec [1] currently says:
> 6. Runtime's start command is invoked with the unique identifier of
> the container.
> 7. The startContainer hooks MUST be invoked by the runtime. If any
> startContainer hook fails, the runtime MUST generate an error, stop
> the container, and continue the lifecycle at step 12.
> 8. The runtime MUST run the user-specified program, as specified by
> process.
> 9. The poststart hooks MUST be invoked by the runtime. If any
> poststart hook fails, the runtime MUST generate an error, stop the
> container, and continue the lifecycle at step 12.
> ...
> 11. Runtime's delete command is invoked with the unique identifier of
> the container.
> 12. The container MUST be destroyed by undoing the steps performed
> during create phase (step 2).
> 13. The poststop hooks MUST be invoked by the runtime. If any poststop
> hook fails, the runtime MUST log a warning, but the remaining hooks
> and lifecycle continue as if the hook had succeeded.
Currently, we do 9 before 8 (heck, even before 6), which is clearly
against the spec and results in issues like the one described in [2].
Let's move running poststart hook to after the user-specified process
has started.
NOTE this patch only fixes the order and does not implement removing
the container when the poststart hook failed (as this part of the spec
is controversial -- destroy et al and should probably be, and currently
are, part of "runc delete").
[1]: https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle
[2]: https://github.com/opencontainers/runc/issues/5182
Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Erik Sjölund <erik.sjolund@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Rename c.signal to c.signalInit, and add c.signal which is a lock-less
form of c.Signal.
To be used by the next patch.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
The libcontainer/devices package has been moved to moby/sys/devices, so
we can just point users to that and keep some compatibility shims around
until runc 1.6. We don't use it at all so there are no other changes
needed.
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
This uses preopened rootfs in Chdir and pivotRoot.
While at it, add O_PATH when opening oldroot in pivotRoot.
Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.
Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Indeed, it does not make sense to prepend c.root once we started using
MkdirAllInRoot in commit 63c29081.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Li Fubang (3):
test: check mount source fds are cleaned up with idmapped mounts
libct: close mount source fd as soon as possible
libct: add a nil check for mountError
LGTMs: kolyshkin rata cyphar
This commit factors out setupAndMountToRootfs without changing any
logic. Use "Hide whitespace changes" during review to focus on the
actual changes.
The refactor ensures the mount source file descriptor is closed via
defer in each loop iteration, reducing the total number of open FDs
in runc. This helps avoid hitting the file descriptor limit under
high concurrency or when handling many mounts.
Signed-off-by: lifubang <lifubang@acmcoder.com>