Extend the garbage-collection framework so a collectible resource can emit
forward references during graph traversal, in addition to the existing
back-reference mechanism.
A CollectionContext may now implement the optional collectionWithReferences
interface:
References(ctx context.Context, node gc.Node, fn func(gc.Node))
When the GC visits a node whose resource type was registered by an external
collector, gcContext.references consults the per-type References
implementation after the built-in core resource types are handled.
This is the forward-reference analogue of collectionWithBackRefs. Whereas
ActiveWithBackRefs must enumerate every edge up front and the gcContext
holds all of them in its backRefs map for the entire collection, References
is invoked on demand for a single node. A collector whose resources fan
out to many other nodes can therefore emit those edges without retaining
them in memory for the gc context.
This commit is intentionally a no-op: no plugin registers a collector that
uses collectionWithReferences yet. It is isolated here so that concurrent
development efforts that depend on this interface can be proposed and
reviewed upstream independently.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Allow the last host to retry on transient network errors to incrase the
likelihood of the operation succeeding and help reduce flaky tests.
Signed-off-by: Derek McGowan <derek@mcg.dev>
TestImagesCreateUpdateDelete asserts that an image's updatedat is
strictly after its createdat. Both timestamps are stamped via
time.Now().UTC(), which strips the monotonic reading, so the comparison
falls back to the wall clock. On platforms with coarse timer resolution
(e.g. Windows, which advances system time at the ~15.6ms tick), the
Create and Update calls can land in the same tick and produce identical
timestamps, making the strict After() check fail intermittently.
Wait for the wall clock to advance past the creation timestamp before
updating so the assertion stays meaningful without depending on clock
resolution. On fine-resolution clocks the loop runs zero iterations.
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
Some proxy stream setup and receive paths still returned raw RPC
status errors while neighboring proxy methods normalized them with
errgrpc.ToNative. This made errdefs checks depend on which proxy API
surfaced the same remote failure.
Normalize event subscription setup and receive errors, and streaming
stream creation errors, while preserving io.EOF for completed receive
streams.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
Most content proxy operations normalize remote RPC errors before
returning them, including stream receive errors from Walk and write
errors from the remote writer. remoteReaderAt.ReadAt was an outlier and
returned raw status errors from Read and Recv.
Callers that use content.ReadBlob through the proxy can then fail
errdefs checks, such as treating concurrent content deletion as
NotFound.
Convert non-EOF read stream errors with errgrpc.ToNative so ReaderAt
matches the rest of the content proxy while preserving io.EOF.
Signed-off-by: Paweł Gronowski <pawel.gronowski@docker.com>
The CRI progress reporter cancels an image pull if it sees no progress
for 5 seconds. It tracks this through active HTTP requests. During
remote fetches, the HTTP response reader is closed via a deferred
call after `content.Copy` completes.
Diagnosis:
`content.Copy` handles both downloading the stream and committing
the writer to the content store. Any delays during the database
commit phase (e.g. from database locks, slow disk syncs, or concurrent
pull deduplication blocks) keep the HTTP connection open. The progress
reporter sees the request is still active (`activeReqs = 1`) but no new
bytes are coming in, leading to a premature timeout cancellation.
Reproduction:
We reproduced this flakiness deterministically on a GCE VM under a
simulated 2 Mbps ingress bandwidth limit using Linux traffic control
ingress policing (`tc filter ... action police rate 2mbit`). Under this
slowness, the download took longer than the progress timeout during the
slow commit phase, triggering context cancellation and failing the
`TestCRIImagePullTimeout/HoldingContentOpenWriterWithLocalPull` test.
Solution:
To fix this, we wrap the HTTP reader in a `closeOnEOFReader` or
`closeOnEOFReadSeeker` before handing it to `content.Copy`. If the
underlying connection reader implements `io.Seeker`, it is dynamically
wrapped in `closeOnEOFReadSeeker` to forward `Seek` operations. This
ensures that O(1) Range seeks are fully preserved during network
resumes or retries. The wrappers automatically close the underlying
network stream as soon as `Read()` returns `io.EOF` (when the download
completes, before the database commit begins). This drops `activeReqs`
to `0` early, freeing the socket and preventing progress timeouts
during commits. A `sync.Once` ensures that subsequent deferred
`Close()` calls do not double-decrement the reporter.
How it was tested:
Verified the fix on a GCE VM under a simulated 2 Mbps ingress
bandwidth limit. Verified seeker safety via standalone logic audits
and trace proofs.
Assisted-by: Antigravity
Signed-off-by: Samuel Karp <samuelkarp@google.com>
Read short-circuited on `if dpc.c == nil` before calling
`dpc.wg.Wait()` which races with the dialer goroutine spawned in
openShimLog. The dialer assigns `dpc.c = c` (and may set `dpc.conerr`)
outside any lock; the only synchronization is the WaitGroup, and Read
skipped it on the fast path.
Signed-off-by: Austin Vazquez <austin.vazquez@docker.com>
Setting the GC labels ensures that extra references may get garbage
collected when the original image using them is removed.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Switch the CRI integration layer from containerd's forked Kubernetes helpers
and clients to the upstream Kubernetes modules, and finalize the dependency
update to Kubernetes v0.36.0.
Replace the remaining internal helper copies with upstream packages:
- internal/cri/clock -> k8s.io/utils/clock
- internal/cri/executil -> upstream CRI exec helpers
- internal/cri/resourcequantity -> k8s.io/apimachinery/pkg/api/resource
- internal/cri/setutils -> k8s.io/apimachinery/pkg/util/sets
- internal/cri/types/labels.go -> internal/cri/labels
- integration/cri-api/pkg/apis/services.go -> k8s.io/cri-api/pkg/apis/services.go
Adopt the upstream CRI clients directly:
- add k8s.io/cri-client v0.36.0, k8s.io/cri-streaming v0.36.0, and
k8s.io/streaming v0.36.0 as direct dependencies
- promote k8s.io/utils to a direct dependency and pull in
k8s.io/component-base v0.36.0 indirectly
- keep integration/remote as a thin containerd adapter around cri-client,
because the integration tests still need the stream-shaped
GetContainerEvents RPC
Finalize the Kubernetes dependency update from v0.36.0-rc.0 to v0.36.0,
refresh vendor/, and drop the obsolete internal utility copies.
Also fix the protobuf MessageState mutex-copy vet failures exposed by the new
APIs and close the temporary integration CRI clients explicitly.
Signed-off-by: Davanum Srinivas <davanum@gmail.com>
Extrareferences may have the prefix flag with the digest added.
Currently they are not being processed. The option today which sets the
digest ref will set both prefix and add digest flags.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Correctly handle cases where the mount activation still exists:
- If activation is fully activate, then just return already exists and
allow the caller to return error or call Info to continue.
- If activation is stale or incomplete due to crash during activation,
overwrite the identifier and cleanup the incomplete activation during
activate.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Kernel 6.12.80+ returns 'fsync=volatile' instead of just 'volatile'
in mount options, which breaks containerd's exact string matching
checks.
Fixes this issue by adding support for 'fsync=volatile' in addition
to the existing 'volatile' check in RemoveVolatileOption and
addVolatileOptionOnImageVolumeMount.
Assisted-by: Antigravity
Signed-off-by: Chris Henzie <chrishenzie@gmail.com>
On export, if the image is by-digest without any tag,
set the org.opencontainers.image.ref.name as the full name.
This prevents setting this field with a leading non-alphanum,
which is incorrect OCI grammar. Fixes#10681.
Signed-off-by: Laura Lorenz <lauralorenz@google.com>
Allow the socket directory to be directly configured by the shim manager
with reasonable defaults when not set. The default for root users will
still be the same directory under the default state directory. For
non-root users a temp directory will be used as default if the state
directory is not owned by the user.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Send the socket directory from containerd to the shim. The shim still
decides where the socket goes but can use the environment variable
passed from containerd to ensure the socket is placed in the configured
directory with proper permission.
This is needed for some rootless cases which do not have permission to
the default state directory as currently set. The directory being
hardcoded by the shim means it is currently not possible to change the
location the shim will listen at.
Signed-off-by: Derek McGowan <derek@mcg.dev>
Mark converted EROFS manifests with the erofs OS feature and cover
feature-aware manifest selection and unpack routing for erofs images.
Signed-off-by: ChengyuZhu6 <hudson@cyzhu.com>
Conditional gc references allows establishing a conditional reference,
which can be used for expiration of specific connections without needing
to updated multiple objects.
For example, content can hold a temporary relationship to a snapshot
that can expire if the snapshot is unused after a specific time. This
allows the just updating the snapshot label when it is used without
needing to update other objects or create an expiring lease to hold the
connection.
Signed-off-by: Derek McGowan <derek@mcg.dev>
- Use time.NewTimer + Stop() instead of time.After to avoid timer leaks
- Treat context.DeadlineExceeded as retryable (pipe busy, not just missing)
- Wrap last dial error instead of os.ErrNotExist for better diagnostics
- Update makeConnection godoc to reflect current BootstrapResult type
Signed-off-by: Esteban Ginez <esteban.ginez@docker.com>
The shim "start" helper returns the named pipe address before the
daemon process has created the pipe via winio.ListenPipe(). On busy
Windows systems, containerd may try to connect before the pipe exists.
Add awaitPipeReady() — the start helper now polls the pipe address
(up to 5s, 10ms intervals) before writing the bootstrap result to
stdout. This follows hcsshim's readiness pattern where the shim
verifies its endpoint is ready before signaling the parent.
As a safety net, also parameterize makeConnection() with a dialer so
binary.Start() uses AnonDialer (retry) for new shims while loadShim()
keeps AnonReconnectDialer (fail-fast) for reconnects per #3659.
On Unix, awaitPipeReady() is a no-op: domain sockets appear atomically.
Signed-off-by: Esteban Ginez <esteban.ginez@docker.com>
The default walking applier performs a real temporary mount for
unpacking, but the mount manager failed to adapt to the walking
differ.
This fixes the EROFS snapshotter together with the default walking
differ, otherwise it reports:
```
ctr: apply layer error for "[]": failed to extract layer sha256:[]:
failed to mount /var/lib/containerd/tmpmounts/containerd-mount3992073457:
internal mount option "X-containerd.mkfs.fs=ext4" was not consumed by
the mount manager
```
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>