Files
runc/libcontainer/dmz
Aleksa Sarai 515f09f7b1 dmz: use overlayfs to write-protect /proc/self/exe if possible
Commit b999376fb2 ("nsenter: cloned_binary: remove bindfd logic
entirely") removed the read-only bind-mount logic from our cloned binary
code because it wasn't really safe because a container with
CAP_SYS_ADMIN could remove the MS_RDONLY bit and get write access to
/proc/self/exe (even with user namespaces this could've been an issue
because it's not clear if the flags are locked).

However, copying a binary does seem to have a minor performance impact.
The only way to have no performance impact would be for the kernel to
block these write attempts, but barring that we could try to reduce the
overhead by coming up with a mount that cannot have it's read-only bits
cleared.

The "simplest" solution is to create a temporary overlayfs using
fsopen(2) which uses the directory where runc exists as a lowerdir,
ensuring that the container cannot access the underlying file -- and we
don't have to do any copies.

While fsopen(2) is not free because mount namespace cloning is usually
expensive (and so it seems like the difference would be marginal), some
basic performance testing seems to indicate there is a ~60% improvement
doing it this way and that it has effectively no overhead even when
compared to just using /proc/self/exe directly:

  % hyperfine --warmup 50 \
  >           "./runc-noclone run -b bundle ctr" \
  >           "./runc-overlayfs run -b bundle ctr" \
  >           "./runc-memfd run -b bundle ctr"

  Benchmark 1: ./runc-noclone run -b bundle ctr
    Time (mean ± σ):      13.7 ms ±   0.9 ms    [User: 6.0 ms, System: 10.9 ms]
    Range (min … max):    11.3 ms …  16.1 ms    184 runs

  Benchmark 2: ./runc-overlayfs run -b bundle ctr
    Time (mean ± σ):      13.9 ms ±   0.9 ms    [User: 6.2 ms, System: 10.8 ms]
    Range (min … max):    11.8 ms …  16.0 ms    180 runs

  Benchmark 3: ./runc-memfd run -b bundle ctr
    Time (mean ± σ):      22.6 ms ±   1.3 ms    [User: 5.7 ms, System: 20.7 ms]
    Range (min … max):    19.9 ms …  26.5 ms    114 runs

  Summary
    ./runc-noclone run -b bundle ctr ran
      1.01 ± 0.09 times faster than ./runc-overlayfs run -b bundle ctr
      1.65 ± 0.15 times faster than ./runc-memfd run -b bundle ctr

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
2024-10-20 21:35:09 +11:00
..
2024-01-21 15:35:46 +01:00
2024-06-29 15:45:25 +02:00
2024-03-13 18:18:09 +11:00

Runc-dmz

runc-dmz is a small and very simple binary used to execute the container's entrypoint.

Making it small

To make it small we use the Linux kernel's nolibc include files, so we don't use the libc.

A full cp of it is here in nolibc/, but removing the Makefile that is GPL. DO NOT FORGET to remove the GPL code if updating the nolibc/ directory.

The current version in that folder is from Linux 6.6-rc3 tag (556fb7131e03b0283672fb40f6dc2d151752aaa7).

It also support all the architectures we support in runc.

If the GOARCH we use for compiling doesn't support nolibc, it fallbacks to using the C stdlib.