mirror of
https://github.com/containerd/containerd.git
synced 2026-06-24 08:48:48 +00:00
The task service guards its containers map with s.mu, and getContainer()
takes it on behalf of effectively every task RPC (State, Connect, Stats,
Wait, Pause, Kill, ...). Create() held s.mu for its whole duration,
including runc.NewContainer(), which runs the actual `runc create`.
`runc create` can be slow on a loaded host. While it runs, any concurrent
task RPC blocks on s.mu. The tasks service applies a 2s timeout to State
(io.containerd.timeout.task.state), so a concurrent State waits on s.mu,
exceeds the deadline, and the ttrpc call is abandoned -- the late shim
reply then shows up as:
ttrpc: received message on inactive stream stream=3
Since deadline errors are now surfaced to clients, this is treated as a
fatal failure and the just-created container is torn down right after
start (observed on Lima/vz: nginx -> Exited (1)).
Move runc.NewContainer() out of the s.mu critical section, mirroring the
runtime v1 shim lock optimization. s.mu is taken only once the container
exists, to guard the map and the remaining (fast) setup, so a slow create
no longer blocks concurrent State and other lookups.
preStart/handleStarted/cleanup only use s.lifecycleMu, so early-exit
handling is unchanged.
See lima-vm/lima#5030.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>