Force a 4K block size on all platforms rather than only on darwin.
An explicit caller-supplied -b is still respected.
Signed-off-by: Chris Crone <christopher.crone@docker.com>
The CRI checkpoint restore path unpacked checkpoint archive/OCI image content
directly into the container's persistent state directory and read files such as
container.log back from it with a symlink-following copy. Checkpoint content is
externally provided, so make restore more defensive about what it unpacks and
how it reads those files back.
Behavior changes:
- Only unpack regular files and directories from the checkpoint archive.
- Unpack checkpoint content into a dedicated <state>/ctrd-restore
subdirectory created fresh rather than into the state dir itself, so
checkpoint content cannot collide with containerd's own files (e.g.
the "status" blob). Restore and cleanup operate on that subdir;
cleanup is now a single RemoveAll of it.
Signed-off-by: Brian Goff <cpuguy83@gmail.com>
Image config labels are copied onto the container by both the CRI
plugin (BuildLabels) and the client's WithImageConfigLabels option
used by `ctr run`. Labels in the containerd.io/* namespace are
interpreted by containerd itself and labels in the io.cri-containerd*
namespace are interpreted by the CRI plugin. An image config is not a
trusted source for labels in either namespace.
Skip labels in both reserved namespaces when copying labels from an
image config to a container, and warn about each label skipped: an
image that tries to set them may be attempting to alter containerd
behavior. Oversized image labels are already skipped this way by
the CRI plugin.
Labels set explicitly by clients, for example via `ctr run --label`
or in the CRI request, are unaffected.
Verified with the CRI plugin and with `ctr run` against an image
whose config carries labels like these: the labels are no longer
present on the created container and a warning is logged for each.
Assisted-by: Claude Code
Signed-off-by: Ben Cressey <ben@cressey.org>
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Filter out any annotations on the checkpointed container matching
`cdi.k8s.io/` or exactly `cdi.k8s.io` during restore to prevent
unauthorized device restoration. When an annotation is denied, a warning
log is generated.
Tested by:
* Unit tests for exact matching, prefix boundaries, and metadata merging
* Complete CRI integration and checkpoint restore suite
Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Between starting the sandbox and adding it to the
sandbox store, there are opportunities for failures
including in any NRI RunPodSandbox prehooks. This defer
is added to that period so if they fail, this function
will try to clean it up itself. If the sandbox is
already added to the persistent store, it will not attempt
to stop the sandbox as it can now be recognized by other
components from the CRI store. ShutdownSandbox is used
instead of StopSandbox as it both stops it and cleans up
all its directories.
Signed-off-by: lauralorenz <lauralorenz@google.com>
The RunPodSandbox unconditionally pre-pulls the pause container
image via ensurePauseImageExists() before starting any sandbox.
However, only the "podsandbox" controller actually uses the pause
image to create a pause container holding namespaces. Shim-based
sandbox controllers (e.g. Kata Containers) manage the sandbox
lifecycle entirely at the shim level and never reference the pause
image.
Add a DisablePauseImagePull flag to the Runtime config that gates
ensurePauseImageExists(). When a sandboxer is not "podsandbox", the
flag skips the unnecessary pre-pull, avoiding wasted network/storage
overhead and reducing sandbox startup latency.
The long-term direction is to offload image pulling entirely to the
controller implementation (shim level); this flag is an incremental
step toward that goal without introducing a breaking behavior change.
Also add unit tests to verify that ensurePauseImageExists is only
invoked for the "podsandbox" sandboxer and correctly skipped otherwise.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
The pull progress reporter resets lastSeenTimestamp on every tick where
activeReqs == 0, but never on the transition to a non-zero count. When a
pull is held in content.OpenWriter (idle in HTTP terms) and then
unblocks, the next request can be cancelled less than `timeout` after it
was actually issued — its first byte must arrive within whatever fraction
of `timeout` remains on the timer captured during the previous idle
tick.
Track the previous tick's activeReqs and reset the timer on the 0→1
transition so a newly-issued request always gets a full timeout window
to produce its first byte. This deflakes
TestCRIImagePullTimeout/HoldingContentOpenWriterWithLocalPull, which
hits ghcr.io directly and can exceed the shrunken window during
auth handshakes in CI.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Add metrics for NRI plugin invocations, latency, adjustments, and active
count. Map NRI Metrics adaptation layer to containerd's Prometheus
metrics system via docker/go-metrics for observability.
Categorize plugin invocation errors into `deadline_exceeded`,
`canceled`, and dynamic gRPC status code dimensions to assist
troubleshooting.
Assisted-by: Antigravity
Signed-off-by: Chris Henzie <chrishenzie@gmail.com>
We need enhance MetadataPath() with checking the layerBlobPath's
suffix to ensure it doesn't end with .dmverity.
And add a unit test asserting that MetadataPath("...dmverity")
returns the path unchanged to lock in the new behavior.
Signed-off-by: Alex Lyn <alex.lyn@antgroup.com>
Switch the CRI integration layer from containerd's forked Kubernetes helpers
and clients to the upstream Kubernetes modules, and finalize the dependency
update to Kubernetes v0.36.0.
Replace the remaining internal helper copies with upstream packages:
- internal/cri/clock -> k8s.io/utils/clock
- internal/cri/executil -> upstream CRI exec helpers
- internal/cri/resourcequantity -> k8s.io/apimachinery/pkg/api/resource
- internal/cri/setutils -> k8s.io/apimachinery/pkg/util/sets
- internal/cri/types/labels.go -> internal/cri/labels
- integration/cri-api/pkg/apis/services.go -> k8s.io/cri-api/pkg/apis/services.go
Adopt the upstream CRI clients directly:
- add k8s.io/cri-client v0.36.0, k8s.io/cri-streaming v0.36.0, and
k8s.io/streaming v0.36.0 as direct dependencies
- promote k8s.io/utils to a direct dependency and pull in
k8s.io/component-base v0.36.0 indirectly
- keep integration/remote as a thin containerd adapter around cri-client,
because the integration tests still need the stream-shaped
GetContainerEvents RPC
Finalize the Kubernetes dependency update from v0.36.0-rc.0 to v0.36.0,
refresh vendor/, and drop the obsolete internal utility copies.
Also fix the protobuf MessageState mutex-copy vet failures exposed by the new
APIs and close the temporary integration CRI clients explicitly.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
mkfs.erofs uses TMPDIR for its internal diskbuf temp files. Windows
does not set TMPDIR (only TEMP/TMP), so the MinGW binary falls back
to "/tmp" which resolves to C:\tmp. That directory does not exist on
most Windows machines. mkstemp fails, and erofs_diskbuf_init returns
ENOSPC regardless of actual errno, producing a misleading "No space
left on device" error even on disks with plenty of free space.
Set TMPDIR to the snapshot directory (parent of the output layer file)
for all mkfs.erofs invocations on Windows. This directory is managed
by containerd and guaranteed to exist. On Unix, TMPDIR is left to the
parent process (no change in behavior).
Signed-off-by: Craig Chelnak <craig.chelnak@docker.com>
Three related bugs prevented Windows CPU affinity from round-tripping
through UpdateContainerResources and ContainerStatus:
1. WithWindowsResources silently dropped AffinityCpus, so the kubelet's
CPU manager reconcile loop never applied affinity changes to running
containers. Add translation from CRI AffinityCpus to OCI
WindowsCPUGroupAffinity.
2. copyResourcesToStatus never read the Affinity field from the OCI spec,
so the stored container status always had AffinityCpus = nil. Add the
read-back loop.
3. deepCopyOf omitted AffinityCpus when snapshotting Windows resources,
silently dropping the field on every Status.Get(). Add the deep copy.
Signed-off-by: zylxjtu <zhang.yuanliang@hotmail.com>
Kernel 6.12.80+ returns 'fsync=volatile' instead of just 'volatile'
in mount options, which breaks containerd's exact string matching
checks.
Fixes this issue by adding support for 'fsync=volatile' in addition
to the existing 'volatile' check in RemoveVolatileOption and
addVolatileOptionOnImageVolumeMount.
Assisted-by: Antigravity
Signed-off-by: Chris Henzie <chrishenzie@gmail.com>
Allows reading snapshot mounts without performing mounts. This is
valuable when the host cannot perform the mounts due to platform or
permissions.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Mirror cAdvisor's instantaneous CPU rate behavior for CRI stats.
Compute UsageNanoCores from the latest two samples only, and leave the field unset when there is not yet enough data to calculate an instantaneous rate. This avoids publishing an authoritative zero before a valid rate exists while keeping containerd aligned with cAdvisor semantics.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
If the ticket time is shorter than the sync time, it will cause the CPU to surge. Use adaptive time and sleep to ensure that the CPU is released.
Signed-off-by: jokemanfire <hu.dingyang@zte.com.cn>