backend.ai icon indicating copy to clipboard operation
backend.ai copied to clipboard

No container ID in DB when krunner is failed to start

Open adrysn opened this issue 3 years ago • 0 comments

On a customer site, we found a bug that a specific user's session is always stuck at the PREPARING state for a specific group of images. It turns out that the problem was caused by failing to start kernel runner inside a container, in this case, the location of python binary is mis-indicated. And, the container ID is not recorded in the DB even though the container is actually created (but not kernel runner), so we cannot destroy the container since no container ID is found from the DB.

I suggest that we have to fix this issue:

Once a container is created, its container ID should be recorded in the DB, regardless of the kernel runner's status. If kernel runner is failed to start, we can mark the container's status as ERROR or just automatically destroy it.

To easily reproduce this phenomenon, just add raise in the _init_jupyter_kernel method in kernel/base.py (image below).

image

adrysn avatar Dec 13 '21 02:12 adrysn