The CRI checkpoint restore path unpacked checkpoint archive/OCI image content
directly into the container's persistent state directory and read files such as
container.log back from it with a symlink-following copy. Checkpoint content is
externally provided, so make restore more defensive about what it unpacks and
how it reads those files back.
Behavior changes:
- Only unpack regular files and directories from the checkpoint archive.
- Unpack checkpoint content into a dedicated <state>/ctrd-restore
subdirectory created fresh rather than into the state dir itself, so
checkpoint content cannot collide with containerd's own files (e.g.
the "status" blob). Restore and cleanup operate on that subdir;
cleanup is now a single RemoveAll of it.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Allow the last host to retry on transient network errors to incrase the
likelihood of the operation succeeding and help reduce flaky tests.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Some proxy stream setup and receive paths still returned raw RPC
status errors while neighboring proxy methods normalized them with
errgrpc.ToNative. This made errdefs checks depend on which proxy API
surfaced the same remote failure.
Normalize event subscription setup and receive errors, and streaming
stream creation errors, while preserving io.EOF for completed receive
streams.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
go1.26.4 includes security fixes to the crypto/x509, mime, and
net/textproto packages, as well as bug fixes to the compiler, the
runtime, the go fix command, and the crypto/fips140 package
Signed-off-by: Akhil Mohan <akhilerm@gmail.com>
Image config labels are copied onto the container by both the CRI
plugin (BuildLabels) and the client's WithImageConfigLabels option
used by `ctr run`. Labels in the containerd.io/* namespace are
interpreted by containerd itself and labels in the io.cri-containerd*
namespace are interpreted by the CRI plugin. An image config is not a
trusted source for labels in either namespace.
Skip labels in both reserved namespaces when copying labels from an
image config to a container, and warn about each label skipped: an
image that tries to set them may be attempting to alter containerd
behavior. Oversized image labels are already skipped this way by
the CRI plugin.
Labels set explicitly by clients, for example via `ctr run --label`
or in the CRI request, are unaffected.
Verified with the CRI plugin and with `ctr run` against an image
whose config carries labels like these: the labels are no longer
present on the created container and a warning is logged for each.
Assisted-by: Claude Code
Signed-off-by: Ben Cressey <ben@cressey.org>
Signed-off-by: Samuel Karp <samuelkarp@google.com>
GHA runners occasionally experience I/O constraints during root-test
test execution. While concurrent tests rapidly allocate loopback
devices, background udev probing stalls. This quickly exhausts
systemd-udevd's default worker pool ceiling (20 children max), stalling
netlink uevent processing so device-mapper device nodes are never
created for subsequent dm-verity test execution.
Logging cgroups v2 pids.peak telemetry confirmed peak in-flight udev
workers accumulate to 325 during test execution. Raising the
children-max limit to 500 provides comfortable buffer room so udevd
freely spawns worker processes without entering event lockup or causing
test timeouts.
Assisted-by: Antigravity
Signed-off-by: Chris Henzie <chrishenzie@gmail.com>
Filter out any annotations on the checkpointed container matching
`cdi.k8s.io/` or exactly `cdi.k8s.io` during restore to prevent
unauthorized device restoration. When an annotation is denied, a warning
log is generated.
Tested by:
* Unit tests for exact matching, prefix boundaries, and metadata merging
* Complete CRI integration and checkpoint restore suite
Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The TestCRIImagePullTimeout test case "NoDataTransferred" flaked under
constrained networks because the test proxy mirror registry used a
blocking ReadAtLeast call to forward bytes to containerd.
This blocking wait (up to 4KB) meant the mirror registry server
completely stopped forwarding data during network slowness, triggering
containerd's aggressive 5-second progress timeout and canceling the
pull before it could reach its 3MB circuit-breaker limit.
This is resolved by changing the proxy's custom copy loop from
io.ReadAtLeast(src, buf, len(buf)) to standard src.Read(buf). This
streams network chunks to containerd immediately as they arrive,
preventing false timeout cancellations while maintaining correct
circuit-breaker byte tracking.
Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
The CRI progress reporter cancels an image pull if it sees no progress
for 5 seconds. It tracks this through active HTTP requests. During
remote fetches, the HTTP response reader is closed via a deferred
call after `content.Copy` completes.
Diagnosis:
`content.Copy` handles both downloading the stream and committing
the writer to the content store. Any delays during the database
commit phase (e.g. from database locks, slow disk syncs, or concurrent
pull deduplication blocks) keep the HTTP connection open. The progress
reporter sees the request is still active (`activeReqs = 1`) but no new
bytes are coming in, leading to a premature timeout cancellation.
Reproduction:
We reproduced this flakiness deterministically on a GCE VM under a
simulated 2 Mbps ingress bandwidth limit using Linux traffic control
ingress policing (`tc filter ... action police rate 2mbit`). Under this
slowness, the download took longer than the progress timeout during the
slow commit phase, triggering context cancellation and failing the
`TestCRIImagePullTimeout/HoldingContentOpenWriterWithLocalPull` test.
Solution:
To fix this, we wrap the HTTP reader in a `closeOnEOFReader` or
`closeOnEOFReadSeeker` before handing it to `content.Copy`. If the
underlying connection reader implements `io.Seeker`, it is dynamically
wrapped in `closeOnEOFReadSeeker` to forward `Seek` operations. This
ensures that O(1) Range seeks are fully preserved during network
resumes or retries. The wrappers automatically close the underlying
network stream as soon as `Read()` returns `io.EOF` (when the download
completes, before the database commit begins). This drops `activeReqs`
to `0` early, freeing the socket and preventing progress timeouts
during commits. A `sync.Once` ensures that subsequent deferred
`Close()` calls do not double-decrement the reporter.
How it was tested:
Verified the fix on a GCE VM under a simulated 2 Mbps ingress
bandwidth limit. Verified seeker safety via standalone logic audits
and trace proofs.
Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Update the setup-go version in our private action yml to
1) be pinned by hash (with comment to version string)
2) remove cache disable that was fixed 3 years ago
Signed-off-by: Phil Estes <estesp@amazon.com>
Update the Fuzzing workflow to upload crash artifacts found during the
go_test_fuzz job. Currently, when `go test -fuzz` fails, the crash
reproducers are generated but not preserved, making it difficult to
diagnose and fix the issues discovered in CI.
This change adds an upload-artifact step that captures all files in
testdata/fuzz directories across the repository upon failure.
Assisted-by: gemini-cli
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Signed-off-by: lauralorenz <lauralorenz@google.com>
Between starting the sandbox and adding it to the
sandbox store, there are opportunities for failures
including in any NRI RunPodSandbox prehooks. This defer
is added to that period so if they fail, this function
will try to clean it up itself. If the sandbox is
already added to the persistent store, it will not attempt
to stop the sandbox as it can now be recognized by other
components from the CRI store. ShutdownSandbox is used
instead of StopSandbox as it both stops it and cleans up
all its directories.
Signed-off-by: lauralorenz <lauralorenz@google.com>
The task service guards its containers map with s.mu, and getContainer()
takes it on behalf of effectively every task RPC (State, Connect, Stats,
Wait, Pause, Kill, ...). Create() held s.mu for its whole duration,
including runc.NewContainer(), which runs the actual `runc create`.
`runc create` can be slow on a loaded host. While it runs, any concurrent
task RPC blocks on s.mu. The tasks service applies a 2s timeout to State
(io.containerd.timeout.task.state), so a concurrent State waits on s.mu,
exceeds the deadline, and the ttrpc call is abandoned -- the late shim
reply then shows up as:
ttrpc: received message on inactive stream stream=3
Since deadline errors are now surfaced to clients, this is treated as a
fatal failure and the just-created container is torn down right after
start (observed on Lima/vz: nginx -> Exited (1)).
Move runc.NewContainer() out of the s.mu critical section, mirroring the
runtime v1 shim lock optimization. s.mu is taken only once the container
exists, to guard the map and the remaining (fast) setup, so a slow create
no longer blocks concurrent State and other lookups.
preStart/handleStarted/cleanup only use s.lifecycleMu, so early-exit
handling is unchanged.
See lima-vm/lima#5030.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>