backend.ai
backend.ai copied to clipboard
No container ID in DB when krunner is failed to start
On a customer site, we found a bug that a specific user's session is always stuck at the PREPARING state for a specific group of images. It turns out that the problem was caused by failing to start kernel runner inside a container, in this case, the location of python binary is mis-indicated. And, the container ID is not recorded in the DB even though the container is actually created (but not kernel runner), so we cannot destroy the container since no container ID is found from the DB.
I suggest that we have to fix this issue:
Once a container is created, its container ID should be recorded in the DB, regardless of the kernel runner's status. If kernel runner is failed to start, we can mark the container's status as ERROR or just automatically destroy it.
To easily reproduce this phenomenon, just add raise
in the _init_jupyter_kernel
method in kernel/base.py
(image below).