dilim/runc - runc - Gitea: Git with a cup of tea

dilim/runc

mirror of https://github.com/opencontainers/runc.git synced 2026-06-24 08:48:44 +00:00

Author	SHA1	Message	Date
Kir Kolyshkin	3a125a799d	Merge pull request #5271 from captainmo1/5251-simplify-exec-fifo-wait libct: simplify exec fifo wait using poll(2)	2026-06-23 11:23:58 -07:00
Rodrigo Campos Catelin	c63f70f883	Merge pull request #5318 from xujihui1985/fix/checkpoint-cgroup2-mount-options ci: workaround to avoid mutate cgroupv2 mount options	2026-06-23 14:45:03 +02:00
sean	3805b01e8a	ci(checkpoint): workaround to avoid mutate cgroupv2 mount options add --manage-cgroups-mode ignore to avoid pollute cgroupv2 mount options during unittest and intergration test https://github.com/checkpoint-restore/criu/issues/3029 Signed-off-by: sean <xujihui1985@gmail.com>	2026-06-23 18:59:03 +08:00
Kir Kolyshkin	f66ace4cfa	deps: bump to go-criu v8.3.0 go-criu v8.3.0 switches to protobuf-go-lite, which helps to remove google.golang.org/protobuf dependency from here, reducing the runc binary size from ~16M to ~14M. The only missing piece is proto.String, proto.Bool, proto.Int32 etc. helpers that return a pointer to a given variable. Those are replaced by a generic mkPtr, which in turn is to be replaced by the new builtin once Go < 1.26 is no longer supported. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-06-15 12:09:36 -07:00
Aleksa Sarai	66acd48f9d	rootfs: make cgroupv1 subsystem symlinks fd-based As with /dev symlinks, this was missed in commit `d40b3439a9` ("rootfs: switch to fd-based handling of mountpoint targets"). It's not really clear to what extent this was exploitable (/sys/fs/cgroup is a tmpfs we create) but it's better to just fix this anyway. Fixes: `d40b3439a9` ("rootfs: switch to fd-based handling of mountpoint targets") Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-06-13 00:26:52 +02:00
Aleksa Sarai	864db8042d	rootfs: make /dev initialisation code fd-based These codepaths are very old and operate on pure paths but before pivot_root(2), meaning that a bad image with a malicious /dev symlink could cause us to operate on host paths instead. In practice this means that we could be tricked into removing a file called "ptmx" (note that /dev/pts/ptmx and /dev/ptmx are both immune for different reasons) or creating a very restricted set of symlinks (with fixed targets and names). The scope of these bugs is thus quite limited, but we definitely need to harden against it. These codepaths were unfortunately missed during the fd-based rework in commit `d40b3439a9` ("rootfs: switch to fd-based handling of mountpoint targets") -- I must've assumed they were called after pivot_root(2)... Fixes: GHSA-xjvp-4fhw-gc47 Fixes: CVE-2026-41579 Fixes: `d40b3439a9` ("rootfs: switch to fd-based handling of mountpoint targets") Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-06-12 18:12:37 +02:00
Aleksa Sarai	fcf04eb41b	rootfs: switch createDevices argument order This argument order matches most other helpers we have and will also match the changes we are about to make to setupPtmx and setupDevSymlinks. Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-06-12 18:12:37 +02:00
Mohammed Aminu Futa	937d887d1c	libct: simplify exec fifo wait using poll(2) Replace the goroutine + channel + 100ms time.After + blocking open in handleFifo with a poll(2) loop on a non-blocking open. Use pidfd_open(2) where available to wait for init exit without timeout, falling back to /proc state checks with 100ms timeout on older kernels. Fixes #5251 Signed-off-by: Mohammed Aminu Futa <mohammedfuta2000@gmail.com> Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-06-06 00:55:14 +00:00
Kir Kolyshkin	269405107f	deps: bump go-criu to v8.2.0 Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-06-04 10:48:15 -07:00
Patrick Stoeckle	e44aa440d9	chore: fix some typos in comments Signed-off-by: Patrick Stoeckle <patrick.stoeckle@siemens.com>	2026-05-27 13:49:23 +02:00
Ricardo Branco	de39d5e79b	tests/int: relax testPids fork error match string The test checked for the exact BusyBox ash diagnostic "sh: can't fork". With BusyBox 1.38, ash reports the failure as: /bin/sh: line 0: can't fork: Resource temporarily unavailable Match the stable "can't fork" part of the error message instead. Signed-off-by: Ricardo Branco <rbranco@suse.de>	2026-05-25 21:52:19 +02:00
Ricardo Branco	3acb097f93	tests/int: build TestPids pipelines programmatically TestPids used long hand-written /bin/true pipelines for the 4-, 32- and 64-command cases. This made the test easy to typo and hard to review, as seen by the earlier "bin/true" entries. Build the shell pipelines instead, preserving the existing test coverage while making the command counts explicit. Signed-off-by: Ricardo Branco <rbranco@suse.de>	2026-05-25 21:52:19 +02:00
Kir Kolyshkin	84762a5c1a	Merge pull request #5285 from lifubang/followup-5275-maskpath libct: Clean up and refactor maskPaths logic	2026-05-18 11:13:16 -07:00
lifubang	b88635e57e	libct: close rootFd ASAP in maskPaths Close the root file descriptor immediately after use in maskPaths to reduce the window during which an attacker could potentially exploit an open fd to access or manipulate the root filesystem. This follows the principle of least privilege and mitigates risks in compromised or malicious container scenarios. Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-05-17 02:26:10 +00:00
lifubang	e7e2f00248	libct: optimize maskPaths for single-directory case This is a follow-up to #5275. That change reused a single tmpfs mount to mask multiple directories, which is efficient when masking more than one path. However, it introduced unnecessary overhead when only one directory is masked. This commit restores the original behavior for the single-path case while preserving shared tmpfs logic for multiple paths. Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-05-16 05:50:01 +00:00
Kir Kolyshkin	16dde3befc	libct/intelrdt: use sync.OnceFunc and sync.OnceValues Switch from sync.Once to sync.OnceFunc and sync.OnceValues. Keep Root a function (rather than a variable) because godoc renders function doc better than a variable doc. Switch to using internal function root internally. Modify tests accordingly (and simplify NewIntelRdtTestUtil to fakeRoot). Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
Kir Kolyshkin	2d2ae8809c	libct/configs/validate: simplify intelrtd tests The whole struct intelRdtStatus with its methods and a sync.Once is not needed, since intelrtd.IsEnabled methods are already run-once (or use run-once and a simple comparison). Yet it is still needed for the test to fake values returned by Enabled. Simplify to use func pointers which a test case overwrites. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
Kir Kolyshkin	5cd0cb6d51	libct/intelrdt: remove newManager It is not doing anything, and tests can just instantiate the &Manager{}. Suggested-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
Kir Kolyshkin	48c7e83b91	libcontainer/configs/validate: use early return ...in intelrdtCheck, like all other checks already do. Best reviewed with --ignore-all-space. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
Kir Kolyshkin	8d1ebab374	libct/utils: use sync.OnceValue Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
Kir Kolyshkin	2ae07a45d6	libct/apparmor: simplify isEnabled 1. Use sync.OnceValue. 2. Fix the len(buf) check -- we only need 1 byte. Real kernel output is "Y\n" so practically this change is a no-op. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-05-15 17:14:15 -07:00
lifubang	c046c9b973	libct: reuse tmpfs for directory masks Kubernetes may add one sysfs thermal_throttle entry per CPU to maskedPaths. On large Intel systems this can produce many directory masks for a single container. runc currently handles each directory mask with a separate read-only tmpfs mount, and therefore a separate tmpfs superblock. On Linux 4.18/RHEL 8 kernels, creating and tearing down many tmpfs superblocks can contend on the global shrinker_rwsem when containers start or stop concurrently. Use one read-only tmpfs for directory masks and bind-mount it over the remaining directory targets. The first non-procfs-fd directory mount is reopened through the container root fd before it is reused. File masks still bind /dev/null, and procfs fd targets keep the existing one-tmpfs-per-target behaviour because they are fd aliases rather than stable rootfs paths. If the bind-mount of the shared source fails (e.g. due to kernel restrictions), fall back to individual tmpfs mounts for all remaining directories. Tmpfs mounts use nr_blocks=1,nr_inodes=1 to minimise kernel resource usage. The bind mounts do not create additional tmpfs superblocks. They also retain the read-only mount flag inherited from the source vfsmount, so the masking semantics remain unchanged. xref: kubernetes/kubernetes#138512 xref: kubernetes/kubernetes#138388 xref: kubernetes/kubernetes#131018 Co-authored-by: Davanum Srinivas <davanum@gmail.com> Refactored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-05-13 13:05:32 +08:00
lifubang	e57a7a4c8f	libct: enforce strict tmpfs limits for masked paths Previously, masked directories (e.g., /proc/acpi, /proc/scsi) were mounted as read-only tmpfs without explicit size or inode limits. Although these mounts are meant to be empty and unwritable, the lack of resource constraints means that—should an attacker bypass the read-only protection (e.g., via container escape, mount namespace manipulation, or a kernel vulnerability)—the tmpfs could consume up to 50% of system memory by default (the kernel's default tmpfs limit). To mitigate this risk in high-density container environments and adhere to the principle of least privilege, we now explicitly set: - nr_blocks=1 (sufficient for at most one block size) - nr_inodes=1 (sufficient for at most one inode) Ref: https://man7.org/linux/man-pages/man5/tmpfs.5.html These limits ensure that even if compromised, kernel memory usage remains strictly bounded and negligible. This change aligns with best practices used by other container runtimes and strengthens defense-in-depth for sensitive masked paths. Co-authored-by: Davanum Srinivas <davanum@gmail.com> Refactored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-05-13 13:05:32 +08:00
lifubang	abf70bab63	libct: skip mount for duplicate masked paths Co-authored-by: Davanum Srinivas <davanum@gmail.com> Refactored-by: lifubang <lifubang@acmcoder.com> Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-05-13 13:05:32 +08:00
Kir Kolyshkin	321073efde	runc exec -p: fix adding HOME to nil env Before commit `7dc24868`, when process.env was nil, prepareEnv returned a flag telling HOME is not set, and it was added. Commit `7dc24868` moved the functionality of adding HOME into prepareEnv but did not properly handle nil case. As a result, runc exec -p with process.json having no env set resulted in an exec with no HOME set. Fix this, and add unit and integration tests. Fixes: `7dc24868` ("libct: switch to numeric UID/GID/groups") Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-29 23:15:18 -07:00
Kir Kolyshkin	1d12f98f85	tests/int: fix TestHook flakiness Since commit `3cdda46` the poststart hooks runs after the container process start, and so they race. Move the poststart hook check to a separate step after the container process has exited. Fixes 5245. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-20 16:35:49 -07:00
Kir Kolyshkin	905958ea65	tests/int: show stderr if command failed part II This adds a few cases missed by commit `bf4fcc30`. Fixes: `bf4fcc30` Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-18 15:43:03 -07:00
Rodrigo Campos	748af2e285	libct/test: Disable GC on test run to catch leaking fds This test is racy for a long time now. All the logs I could find in CI seem to be dangling symlinks, like the test shows "23 -> ". This means the fd was closed before we did the call to readlink(). Let's try to disable the GC. This should get rid of the "fds are getting closed before we read them" part. Updates: #4297 Signed-off-by: Rodrigo Campos <rodrigo@amutable.com>	2026-04-15 17:08:29 -07:00
Kir Kolyshkin	9970cbfdb6	libct/int: switch from bytes.Buffer to strings.Builder The latter is simpler and provides just enough functionality to be used here. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-14 17:05:06 -07:00
Kir Kolyshkin	568a309225	libct/int: remove buffers.Stdin It is never used. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-14 17:05:06 -07:00
Kir Kolyshkin	54be90bf68	libct/int: use readlink -v By default, readlink is silent about any errors. Make it verbose so we can better interpret any test failures. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-14 17:05:05 -07:00
Kir Kolyshkin	bf4fcc3002	libct/int: show stderr if command failed When running a process inside a container, make sure its stderr is not nil (except for some trivial cases like cat). Modify waitProcess to show failed command's stderr, if possible. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-14 17:04:18 -07:00
Kir Kolyshkin	dd9fda7d60	libct/int: waitProcess: rm dead code Since Wait returns an ExitError if process' exit status is not 0, checking process status is redundant and this code is never reached. Remove it. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-14 17:03:56 -07:00
sean	ec170d8672	fix(libcontainer): preserve rootfs slave propagation When rootfsPropagation is set to rslave, prepareRoot() was forcing the rootfs parent mount to MS_PRIVATE before bind-mounting and pivoting into the rootfs. That breaks the slave relationship needed for HostToContainer propagation, so later unmount/remount events on host mountpoints under the rootfs are not reflected inside the running container. Fix this by keeping the rootfs parent mount as MS_SLAVE for slave-like rootfs propagation settings, while leaving the final root propagation remount in place. Signed-off-by: sean <xujihui1985@gmail.com>	2026-04-11 10:22:16 +08:00
Rodrigo Campos Catelin	d57a45eb78	Merge pull request #5227 from cyphar/internal-cmsg-package libct: move cmsg helpers to new internal/cmsg package	2026-04-08 11:36:32 +02:00
Rodrigo Campos Catelin	4c8d72d54d	Merge pull request #5186 from kolyshkin/poststart Move poststart hook from runc create to runc start	2026-04-08 11:35:17 +02:00
Aleksa Sarai	ca509e76ff	libct: move cmsg helpers to new internal/cmsg package These helpers all make more sense as a self-contained package and moving them has the added benefit of removing an unneeded libpathrs dependency (from libcontainer/utils's import of pathrs-lite) from several test binaries. Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-04-08 01:21:41 +10:00
Sebastiaan van Stijn	ba83c7c7d7	libcontainer/devices: add '//go:fix inline' directives This allows users to automaticaly migrate to the new location using `go fix`. It has some limitations, but can help smoothen the transition; for example, taking this file; ``` package main import ( "github.com/opencontainers/runc/libcontainer/devices" ) func main() { _, _ = devices.DeviceFromPath("a", "b") _, _ = devices.HostDevices() _, _ = devices.GetDevices("a") } ``` Running `go fix -mod=readonly ./...` will migrate the code; ``` package main import ( devices0 "github.com/moby/sys/devices" ) func main() { _, _ = devices0.DeviceFromPath("a", "b") _, _ = devices0.HostDevices() _, _ = devices0.GetDevices("a") } ``` updates `b345c78dca` Signed-off-by: Sebastiaan van Stijn <github@gone.nl>	2026-04-04 19:36:43 +02:00
Kir Kolyshkin	3cdda464fa	Move poststart hook from runc create to runc start The runtime-spec [1] currently says: > 6. Runtime's start command is invoked with the unique identifier of > the container. > 7. The startContainer hooks MUST be invoked by the runtime. If any > startContainer hook fails, the runtime MUST generate an error, stop > the container, and continue the lifecycle at step 12. > 8. The runtime MUST run the user-specified program, as specified by > process. > 9. The poststart hooks MUST be invoked by the runtime. If any > poststart hook fails, the runtime MUST generate an error, stop the > container, and continue the lifecycle at step 12. > ... > 11. Runtime's delete command is invoked with the unique identifier of > the container. > 12. The container MUST be destroyed by undoing the steps performed > during create phase (step 2). > 13. The poststop hooks MUST be invoked by the runtime. If any poststop > hook fails, the runtime MUST log a warning, but the remaining hooks > and lifecycle continue as if the hook had succeeded. Currently, we do 9 before 8 (heck, even before 6), which is clearly against the spec and results in issues like the one described in [2]. Let's move running poststart hook to after the user-specified process has started. NOTE this patch only fixes the order and does not implement removing the container when the poststart hook failed (as this part of the spec is controversial -- destroy et al and should probably be, and currently are, part of "runc delete"). [1]: https://github.com/opencontainers/runtime-spec/blob/main/runtime.md#lifecycle [2]: https://github.com/opencontainers/runc/issues/5182 Reported-by: ningmingxiao <ning.mingxiao@zte.com.cn> Reported-by: Erik Sjölund <erik.sjolund@gmail.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-02 12:28:54 -07:00
Kir Kolyshkin	2253475660	libct: factor handleFifo out of c.exec No functional change. To be used by the next patch. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-02 10:22:22 -07:00
Kir Kolyshkin	b0762c7af1	libct: add lock-less c.signal Rename c.signal to c.signalInit, and add c.signal which is a lock-less form of c.Signal. To be used by the next patch. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-04-02 10:22:22 -07:00
Aleksa Sarai	b345c78dca	libct/devices: deprecate in favour of moby/sys/devices The libcontainer/devices package has been moved to moby/sys/devices, so we can just point users to that and keep some compatibility shims around until runc 1.6. We don't use it at all so there are no other changes needed. Signed-off-by: Aleksa Sarai <aleksa@amutable.com>	2026-04-02 22:54:14 +11:00
lfbzhm	5b094ed1ac	libct: use preopened rootfs more This uses preopened rootfs in Chdir and pivotRoot. While at it, add O_PATH when opening oldroot in pivotRoot. Co-authored-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: lfbzhm <lifubang@acmcoder.com> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-29 12:02:38 -07:00
Kir Kolyshkin	28cb321887	Pre-open container root directory A lot of filesystem-related stuff happens inside the container root directory, and we have used its name before. It makes sense to pre-open it and use a os.File handle instead. Function names in internal/pathrs are kept as is for simplicity (and it is an internal package), but they now accept root as os.File. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-29 12:02:36 -07:00
Kir Kolyshkin	78b80677f6	libct: minor refactor in mountToRootfs No change in functionality, just a preparation for the next patch. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-28 23:48:07 -07:00
Kir Kolyshkin	60352524d3	libct: mountCgroupV1: address TODO Indeed, it does not make sense to prepend c.root once we started using MkdirAllInRoot in commit `63c29081`. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-28 23:48:07 -07:00
Aleksa Sarai	7b40afb6cc	merge #5177 into opencontainers/runc:main Li Fubang (3): test: check mount source fds are cleaned up with idmapped mounts libct: close mount source fd as soon as possible libct: add a nil check for mountError LGTMs: kolyshkin rata cyphar	2026-03-28 17:32:21 +11:00
Kir Kolyshkin	f00b2f9fd5	libct/exeseal: drop own F_SEAL_EXEC Since golang.org/x/sys@v0.22 it is available from unix. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>	2026-03-20 15:57:55 -07:00
lifubang	c77e71a3e7	libct: close mount source fd as soon as possible This commit factors out setupAndMountToRootfs without changing any logic. Use "Hide whitespace changes" during review to focus on the actual changes. The refactor ensures the mount source file descriptor is closed via defer in each loop iteration, reducing the total number of open FDs in runc. This helps avoid hitting the file descriptor limit under high concurrency or when handling many mounts. Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-03-20 01:09:49 +00:00
lifubang	0d0fd95731	libct: add a nil check for mountError Signed-off-by: lifubang <lifubang@acmcoder.com>	2026-03-19 15:47:32 +00:00

1 2 3 4 5 ...

3093 Commits