3093 Commits

Author SHA1 Message Date
Kir Kolyshkin
3a125a799d Merge pull request #5271 from captainmo1/5251-simplify-exec-fifo-wait
libct: simplify exec fifo wait using poll(2)
2026-06-23 11:23:58 -07:00
Rodrigo Campos Catelin
c63f70f883 Merge pull request #5318 from xujihui1985/fix/checkpoint-cgroup2-mount-options
ci: workaround to avoid mutate cgroupv2 mount options
2026-06-23 14:45:03 +02:00
sean
3805b01e8a ci(checkpoint): workaround to avoid mutate cgroupv2 mount options
add --manage-cgroups-mode ignore to avoid pollute cgroupv2 mount options
during unittest and intergration test
https://github.com/checkpoint-restore/criu/issues/3029

Signed-off-by: sean <xujihui1985@gmail.com>
2026-06-23 18:59:03 +08:00
Kir Kolyshkin
f66ace4cfa deps: bump to go-criu v8.3.0
go-criu v8.3.0 switches to protobuf-go-lite, which helps to remove
google.golang.org/protobuf dependency from here, reducing the runc
binary size from ~16M to ~14M.

The only missing piece is proto.String, proto.Bool, proto.Int32 etc.
helpers that return a pointer to a given variable. Those are replaced
by a generic mkPtr, which in turn is to be replaced by the new builtin
once Go < 1.26 is no longer supported.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-06-15 12:09:36 -07:00
Aleksa Sarai
66acd48f9d rootfs: make cgroupv1 subsystem symlinks fd-based
As with /dev symlinks, this was missed in commit d40b3439a9 ("rootfs:
switch to fd-based handling of mountpoint targets"). It's not really
clear to what extent this was exploitable (/sys/fs/cgroup is a tmpfs we
create) but it's better to just fix this anyway.

Fixes: d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-06-13 00:26:52 +02:00
Aleksa Sarai
864db8042d rootfs: make /dev initialisation code fd-based
These codepaths are very old and operate on pure paths but before
pivot_root(2), meaning that a bad image with a malicious /dev symlink
could cause us to operate on host paths instead.

In practice this means that we could be tricked into removing a file
called "ptmx" (note that /dev/pts/ptmx and /dev/ptmx are both immune for
different reasons) or creating a very restricted set of symlinks (with
fixed targets and names). The scope of these bugs is thus quite limited,
but we definitely need to harden against it.

These codepaths were unfortunately missed during the fd-based rework in
commit d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint
targets") -- I must've assumed they were called after pivot_root(2)...

Fixes: GHSA-xjvp-4fhw-gc47
Fixes: CVE-2026-41579
Fixes: d40b3439a9 ("rootfs: switch to fd-based handling of mountpoint targets")
Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-06-12 18:12:37 +02:00
Aleksa Sarai
fcf04eb41b rootfs: switch createDevices argument order
This argument order matches most other helpers we have and will also
match the changes we are about to make to setupPtmx and
setupDevSymlinks.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-06-12 18:12:37 +02:00
Mohammed Aminu Futa
937d887d1c libct: simplify exec fifo wait using poll(2)
Replace the goroutine + channel + 100ms time.After + blocking open
in handleFifo with a poll(2) loop on a non-blocking open. Use
pidfd_open(2) where available to wait for init exit without timeout,
falling back to /proc state checks with 100ms timeout on older
kernels.

Fixes #5251

Signed-off-by: Mohammed Aminu Futa <mohammedfuta2000@gmail.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-06-06 00:55:14 +00:00
Kir Kolyshkin
269405107f deps: bump go-criu to v8.2.0
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-06-04 10:48:15 -07:00
Patrick Stoeckle
e44aa440d9 chore: fix some typos in comments
Signed-off-by: Patrick Stoeckle <patrick.stoeckle@siemens.com>
2026-05-27 13:49:23 +02:00
Ricardo Branco
de39d5e79b tests/int: relax testPids fork error match string
The test checked for the exact BusyBox ash diagnostic "sh: can't fork".
With BusyBox 1.38, ash reports the failure as:

  /bin/sh: line 0: can't fork: Resource temporarily unavailable

Match the stable "can't fork" part of the error message instead.

Signed-off-by: Ricardo Branco <rbranco@suse.de>
2026-05-25 21:52:19 +02:00
Ricardo Branco
3acb097f93 tests/int: build TestPids pipelines programmatically
TestPids used long hand-written /bin/true pipelines for the 4-, 32- and
64-command cases. This made the test easy to typo and hard to review, as
seen by the earlier "bin/true" entries.

Build the shell pipelines instead, preserving the existing test coverage
while making the command counts explicit.

Signed-off-by: Ricardo Branco <rbranco@suse.de>
2026-05-25 21:52:19 +02:00
Kir Kolyshkin
84762a5c1a Merge pull request #5285 from lifubang/followup-5275-maskpath
libct: Clean up and refactor maskPaths logic
2026-05-18 11:13:16 -07:00
lifubang
b88635e57e libct: close rootFd ASAP in maskPaths
Close the root file descriptor immediately after use in maskPaths to
reduce the window during which an attacker could potentially exploit
an open fd to access or manipulate the root filesystem. This follows
the principle of least privilege and mitigates risks in compromised
or malicious container scenarios.

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-05-17 02:26:10 +00:00
lifubang
e7e2f00248 libct: optimize maskPaths for single-directory case
This is a follow-up to #5275. That change reused a single tmpfs mount
to mask multiple directories, which is efficient when masking more than
one path. However, it introduced unnecessary overhead when only one
directory is masked. This commit restores the original behavior for the
single-path case while preserving shared tmpfs logic for multiple paths.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-05-16 05:50:01 +00:00
Kir Kolyshkin
16dde3befc libct/intelrdt: use sync.OnceFunc and sync.OnceValues
Switch from sync.Once to sync.OnceFunc and sync.OnceValues.

Keep Root a function (rather than a variable) because godoc
renders function doc better than a variable doc.

Switch to using internal function root internally.

Modify tests accordingly (and simplify NewIntelRdtTestUtil to
fakeRoot).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
Kir Kolyshkin
2d2ae8809c libct/configs/validate: simplify intelrtd tests
The whole struct intelRdtStatus with its methods and a sync.Once is not
needed, since intelrtd.Is*Enabled methods are already run-once (or use
run-once and a simple comparison).

Yet it is still needed for the test to fake values returned by *Enabled.

Simplify to use func pointers which a test case overwrites.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
Kir Kolyshkin
5cd0cb6d51 libct/intelrdt: remove newManager
It is not doing anything, and tests can just instantiate the &Manager{}.

Suggested-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
Kir Kolyshkin
48c7e83b91 libcontainer/configs/validate: use early return
...in intelrdtCheck, like all other checks already do.

Best reviewed with --ignore-all-space.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
Kir Kolyshkin
8d1ebab374 libct/utils: use sync.OnceValue
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
Kir Kolyshkin
2ae07a45d6 libct/apparmor: simplify isEnabled
1. Use sync.OnceValue.

2. Fix the len(buf) check -- we only need 1 byte. Real kernel output
   is "Y\n" so practically this change is a no-op.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-05-15 17:14:15 -07:00
lifubang
c046c9b973 libct: reuse tmpfs for directory masks
Kubernetes may add one sysfs thermal_throttle entry per CPU to
maskedPaths. On large Intel systems this can produce many directory
masks for a single container. runc currently handles each directory
mask with a separate read-only tmpfs mount, and therefore a separate
tmpfs superblock.

On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs
superblocks can contend on the global shrinker_rwsem when containers
start or stop concurrently.

Use one read-only tmpfs for directory masks and bind-mount it over the
remaining directory targets. The first non-procfs-fd directory mount is
reopened through the container root fd before it is reused. File masks
still bind /dev/null, and procfs fd targets keep the existing
one-tmpfs-per-target behaviour because they are fd aliases rather than
stable rootfs paths.

If the bind-mount of the shared source fails (e.g. due to kernel
restrictions), fall back to individual tmpfs mounts for all remaining
directories. Tmpfs mounts use nr_blocks=1,nr_inodes=1 to minimise
kernel resource usage.

The bind mounts do not create additional tmpfs superblocks. They also
retain the read-only mount flag inherited from the source vfsmount, so
the masking semantics remain unchanged.

xref: kubernetes/kubernetes#138512
xref: kubernetes/kubernetes#138388
xref: kubernetes/kubernetes#131018

Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Refactored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-05-13 13:05:32 +08:00
lifubang
e57a7a4c8f libct: enforce strict tmpfs limits for masked paths
Previously, masked directories (e.g., /proc/acpi, /proc/scsi) were
mounted as read-only tmpfs without explicit size or inode limits.
Although these mounts are meant to be empty and unwritable, the lack
of resource constraints means that—should an attacker bypass the
read-only protection (e.g., via container escape, mount namespace
manipulation, or a kernel vulnerability)—the tmpfs could consume up
to 50% of system memory by default (the kernel's default tmpfs limit).

To mitigate this risk in high-density container environments and
adhere to the principle of least privilege, we now explicitly set:
  - nr_blocks=1 (sufficient for at most one block size)
  - nr_inodes=1 (sufficient for at most one inode)
Ref: https://man7.org/linux/man-pages/man5/tmpfs.5.html

These limits ensure that even if compromised, kernel memory usage
remains strictly bounded and negligible.

This change aligns with best practices used by other container
runtimes and strengthens defense-in-depth for sensitive masked paths.

Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Refactored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-05-13 13:05:32 +08:00
lifubang
abf70bab63 libct: skip mount for duplicate masked paths
Co-authored-by: Davanum Srinivas <davanum@gmail.com>
Refactored-by: lifubang <lifubang@acmcoder.com>
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-05-13 13:05:32 +08:00
Kir Kolyshkin
321073efde runc exec -p: fix adding HOME to nil env
Before commit 7dc24868, when process.env was nil, prepareEnv
returned a flag telling HOME is not set, and it was added.

Commit 7dc24868 moved the functionality of adding HOME into
prepareEnv but did not properly handle nil case. As a result,
runc exec -p with process.json having no env set resulted in
an exec with no HOME set.

Fix this, and add unit and integration tests.

Fixes: 7dc24868 ("libct: switch to numeric UID/GID/groups")
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-29 23:15:18 -07:00
Kir Kolyshkin
1d12f98f85 tests/int: fix TestHook flakiness
Since commit 3cdda46 the poststart hooks runs after the container
process start, and so they race.

Move the poststart hook check to a separate step after the container
process has exited.

Fixes 5245.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-20 16:35:49 -07:00
Kir Kolyshkin
905958ea65 tests/int: show stderr if command failed part II
This adds a few cases missed by commit bf4fcc30.

Fixes: bf4fcc30
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-18 15:43:03 -07:00
Rodrigo Campos
748af2e285 libct/test: Disable GC on test run to catch leaking fds
This test is racy for a long time now. All the logs I could find in CI
seem to be dangling symlinks, like the test shows "23 -> ". This means
the fd was closed before we did the call to readlink().

Let's try to disable the GC. This should get rid of the "fds are getting
closed before we read them" part.

Updates: #4297

Signed-off-by: Rodrigo Campos <rodrigo@amutable.com>
2026-04-15 17:08:29 -07:00
Kir Kolyshkin
9970cbfdb6 libct/int: switch from bytes.Buffer to strings.Builder
The latter is simpler and provides just enough functionality to be used
here.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-14 17:05:06 -07:00
Kir Kolyshkin
568a309225 libct/int: remove buffers.Stdin
It is never used.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-14 17:05:06 -07:00
Kir Kolyshkin
54be90bf68 libct/int: use readlink -v
By default, readlink is silent about any errors. Make it verbose so we
can better interpret any test failures.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-14 17:05:05 -07:00
Kir Kolyshkin
bf4fcc3002 libct/int: show stderr if command failed
When running a process inside a container, make sure its stderr is not
nil (except for some trivial cases like cat). Modify waitProcess to show
failed command's stderr, if possible.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-14 17:04:18 -07:00
Kir Kolyshkin
dd9fda7d60 libct/int: waitProcess: rm dead code
Since Wait returns an ExitError if process' exit status is not 0,
checking process status is redundant and this code is never reached.

Remove it.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-14 17:03:56 -07:00
sean
ec170d8672 fix(libcontainer): preserve rootfs slave propagation
When rootfsPropagation is set to rslave, prepareRoot() was forcing the
rootfs parent mount to MS_PRIVATE before bind-mounting and pivoting into
the rootfs. That breaks the slave relationship needed for HostToContainer
propagation, so later unmount/remount events on host mountpoints under
the rootfs are not reflected inside the running container.

Fix this by keeping the rootfs parent mount as MS_SLAVE for slave-like
rootfs propagation settings, while leaving the final root propagation
remount in place.

Signed-off-by: sean <xujihui1985@gmail.com>
2026-04-11 10:22:16 +08:00
Rodrigo Campos Catelin
d57a45eb78 Merge pull request #5227 from cyphar/internal-cmsg-package
libct: move cmsg helpers to new internal/cmsg package
2026-04-08 11:36:32 +02:00
Rodrigo Campos Catelin
4c8d72d54d Merge pull request #5186 from kolyshkin/poststart
Move poststart hook from runc create to runc start
2026-04-08 11:35:17 +02:00
Aleksa Sarai
ca509e76ff libct: move cmsg helpers to new internal/cmsg package
These helpers all make more sense as a self-contained package and moving
them has the added benefit of removing an unneeded libpathrs dependency
(from libcontainer/utils's import of pathrs-lite) from several test
binaries.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-04-08 01:21:41 +10:00
Sebastiaan van Stijn
ba83c7c7d7 libcontainer/devices: add '//go:fix inline' directives
This allows users to automaticaly migrate to the new location
using `go fix`. It has some limitations, but can help smoothen
the transition; for example, taking this file;

```
package main

import (
	"github.com/opencontainers/runc/libcontainer/devices"
)

func main() {
	_, _ = devices.DeviceFromPath("a", "b")
	_, _ = devices.HostDevices()
	_, _ = devices.GetDevices("a")
}
```

Running `go fix -mod=readonly ./...` will migrate the code;

```
package main

import (
	devices0 "github.com/moby/sys/devices"
)

func main() {
	_, _ = devices0.DeviceFromPath("a", "b")
	_, _ = devices0.HostDevices()
	_, _ = devices0.GetDevices("a")
}
```

updates b345c78dca

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
2026-04-04 19:36:43 +02:00
Kir Kolyshkin
3cdda464fa Move poststart hook from runc create to runc start
The runtime-spec [1] currently says:

> 6. Runtime's start command is invoked with the unique identifier of
>    the container.
> 7. The startContainer hooks MUST be invoked by the runtime. If any
>    startContainer hook fails, the runtime MUST generate an error, stop
>    the container, and continue the lifecycle at step 12.
> 8. The runtime MUST run the user-specified program, as specified by
>    process.
> 9. The poststart hooks MUST be invoked by the runtime. If any
>    poststart hook fails, the runtime MUST generate an error, stop the
>    container, and continue the lifecycle at step 12.
> ...
> 11. Runtime's delete command is invoked with the unique identifier of
>     the container.
> 12. The container MUST be destroyed by undoing the steps performed
>     during create phase (step 2).
> 13. The poststop hooks MUST be invoked by the runtime. If any poststop
>     hook fails, the runtime MUST log a warning, but the remaining hooks
>     and lifecycle continue as if the hook had succeeded.

Currently, we do 9 before 8 (heck, even before 6), which is clearly
against the spec and results in issues like the one described in [2].

Let's move running poststart hook to after the user-specified process
has started.

NOTE this patch only fixes the order and does not implement removing
the container when the poststart hook failed (as this part of the spec
is controversial -- destroy et al and should probably be, and currently
are, part of "runc delete").

[1]: https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle
[2]: https://github.com/opencontainers/runc/issues/5182

Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn>
Reported-by: Erik Sjölund <erik.sjolund@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-02 12:28:54 -07:00
Kir Kolyshkin
2253475660 libct: factor handleFifo out of c.exec
No functional change. To be used by the next patch.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-02 10:22:22 -07:00
Kir Kolyshkin
b0762c7af1 libct: add lock-less c.signal
Rename c.signal to c.signalInit, and add c.signal which is a lock-less
form of c.Signal.

To be used by the next patch.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-04-02 10:22:22 -07:00
Aleksa Sarai
b345c78dca libct/devices: deprecate in favour of moby/sys/devices
The libcontainer/devices package has been moved to moby/sys/devices, so
we can just point users to that and keep some compatibility shims around
until runc 1.6. We don't use it at all so there are no other changes
needed.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>
2026-04-02 22:54:14 +11:00
lfbzhm
5b094ed1ac libct: use preopened rootfs more
This uses preopened rootfs in Chdir and pivotRoot.

While at it, add O_PATH when opening oldroot in pivotRoot.

Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: lfbzhm <lifubang@acmcoder.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-29 12:02:38 -07:00
Kir Kolyshkin
28cb321887 Pre-open container root directory
A lot of filesystem-related stuff happens inside the container root
directory, and we have used its name before. It makes sense to pre-open
it and use a *os.File handle instead.

Function names in internal/pathrs are kept as is for simplicity (and it
is an internal package), but they now accept root as *os.File.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-29 12:02:36 -07:00
Kir Kolyshkin
78b80677f6 libct: minor refactor in mountToRootfs
No change in functionality, just a preparation for the next patch.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-28 23:48:07 -07:00
Kir Kolyshkin
60352524d3 libct: mountCgroupV1: address TODO
Indeed, it does not make sense to prepend c.root once we started using
MkdirAllInRoot in commit 63c29081.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-28 23:48:07 -07:00
Aleksa Sarai
7b40afb6cc merge #5177 into opencontainers/runc:main
Li Fubang (3):
  test: check mount source fds are cleaned up with idmapped mounts
  libct: close mount source fd as soon as possible
  libct: add a nil check for mountError

LGTMs: kolyshkin rata cyphar
2026-03-28 17:32:21 +11:00
Kir Kolyshkin
f00b2f9fd5 libct/exeseal: drop own F_SEAL_EXEC
Since golang.org/x/sys@v0.22 it is available from unix.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2026-03-20 15:57:55 -07:00
lifubang
c77e71a3e7 libct: close mount source fd as soon as possible
This commit factors out setupAndMountToRootfs without changing any
logic. Use "Hide whitespace changes" during review to focus on the
actual changes.

The refactor ensures the mount source file descriptor is closed via
defer in each loop iteration, reducing the total number of open FDs
in runc. This helps avoid hitting the file descriptor limit under
high concurrency or when handling many mounts.

Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-03-20 01:09:49 +00:00
lifubang
0d0fd95731 libct: add a nil check for mountError
Signed-off-by: lifubang <lifubang@acmcoder.com>
2026-03-19 15:47:32 +00:00