Files
containerd/integration/sandbox_run_linux_test.go
Wei Fu 2042e805b8 cri/server/podsandbox: disable event subscriber
We have individual goroutine for each sandbox container. If there is any
error in handler, that goroutine will put event in that backoff queue.
So we don't need event subscriber for podsandbox. Otherwise, there will
be two goroutines to cleanup sandbox container.

```
>>>> From EventMonitor
  time="2025-10-23T19:30:59.626254404Z" level=debug msg="Received containerd event timestamp - 2025-10-23 19:30:59.624494674 +0000 UTC, namespace - \"k8s.io\", topic - \"/tasks/exit\""
  time="2025-10-23T19:30:59.626301912Z" level=debug msg="TaskExit event in podsandbox handler container_id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" id:\"22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf\" pid:203121 exit_status:137 exited_at:{seconds:1761247859 nanos:624467824}"

>>> If EventMonitor handles task exit well, it will close ttrpc
connection and then waitSandboxExit could encounter ttrpc-closed error

  time="2025-10-23T19:30:59.688031150Z" level=error msg="failed to delete task" error="ttrpc: closed" id=22e15114133e4d461ab380654fb76f3e73d3e0323989c422fa17882762979ccf
```

If both task.Delete calls fail but the shim has already been shut down, it
could trigger a new task.Exit event sent by cleanupAfterDeadShim. This would
result in three events in the EventMonitor's backoff queue, which is unnecessary
and could cause confusion due to duplicate events.

The worst-case scenario caused by two concurrent task.Delete calls is a shim
leak. The timeline for this scenario is as follows:

| Timestamp | Component       | Action                        | Result                                                                                           |
| ------    | -----------     | --------                      | --------                                                                                         |
| T1        | EventMonitor    | Sends `task.Delete`           | Marked as Req-1                                                                                  |
| T2        | waitSandboxExit | Sends `task.Delete`           | Marked as Req-2                                                                                  |
| T3        | containerd-shim | Handles Req-2                 | Container transitions from stopped to deleted                                                    |
| T4        | containerd-shim | Handles Req-1                 | Fails - container already deleted<br>Returns error: `cannot delete a deleted process: not found` |
| T5        | EventMonitor    | Receives `not found` error    | -                                                                                                |
| T6        | EventMonitor    | Sends `shim.Shutdown` request | No-op (active container record still exists)                                                     |
| T7        | EventMonitor    | Closes ttrpc connection       | Clean container state dir                                                                        |
| T8        | containerd-shim | Handles Req-2                 | Removes container record from memory                                                             |
| T9        | waitSandboxExit | Receives error                | Error: `ttrpc: closed`                                                                           |
| T10       | waitSandboxExit | Sends `shim.Shutdown` request | Fails (connection already closed)                                                                |
| T11       | waitSandboxExit | Closes ttrpc connection       | No-op (already closed)                                                                           |

The containerd-shim is still running because shim.Shutdown was sent at T6
before T8. Because container's state dir is deleted at T7, it's unable to clean
it up after containerd restarted.

We should avoid concurrent task.Delete calls here.

I also add subcommand - shutdown - in `ctr shim` for debug.

Fixed: #12344

Signed-off-by: Wei Fu <fuweid89@gmail.com>
2025-10-24 12:19:52 -04:00

45 lines
1.2 KiB
Go

/*
Copyright The containerd Authors.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/
package integration
import (
"testing"
"github.com/stretchr/testify/require"
)
func TestPodSandboxController_ShouldBackoffExitEventWhenFail(t *testing.T) {
t.Logf("Inject Shim failpoint")
sbConfig := PodSandboxConfig(t.Name(), "failpoint")
injectShimFailpoint(t, sbConfig, map[string]string{
"Delete": "1*error(retry)",
})
t.Log("Create a sandbox")
sbID, err := runtimeService.RunPodSandbox(sbConfig, failpointRuntimeHandler)
require.NoError(t, err)
t.Log("Stop the sandbox")
err = runtimeService.StopPodSandbox(sbID)
require.NoError(t, err)
t.Log("Delete sandbox")
err = runtimeService.RemovePodSandbox(sbID)
require.NoError(t, err)
}