docker-stacks
docker-stacks copied to clipboard
Kernel crash when using tensorflow/pytorch notebook image
What docker image(s) are you using?
pytorch-notebook, tensorflow-notebook
Host OS system
Ubuntu 23.10
Host architecture
x86_64
What Docker command are you running?
docker run -it --rm -p 8888:8888 quay.io/jupyter/tensorflow-notebook:tensorflow-2.16.1
docker run -it --rm -p 8888:8888 quay.io/jupyter/pytorch-notebook:pytorch-2.2.2
How to Reproduce the problem?
It is hard to give a full Minimum Working Example to reproduce the bug because it happens when training a specific DL model on CPU via Keras that is not so easy to fully reduce. It only happens when running my code via the jupyter/tensorflow-notebook and jupyter/pytorch-notebook images (not when I run the code directly on my system).
I have an easy workaround (defining keras loss via a function instead of a class instance) but I thought you will be interested to know about this weird behavior.
See this Keras issue for more context: https://github.com/keras-team/keras/issues/19601
Command output
No response
Expected behavior
No response
Actual behavior
Kernel crashes
Anything else?
My code is run by a jupyterlab server (using the lastest official docker images jupyter/tensorflow-notebook and jupyter/pytorch-notebook from jupyter/docker-stack) and I connect to it via the vscode-jupypter extension.
The crash is caused by the model.fit() call. It happens within a few seconds when using the torch backend, and a bit later with the tensorflow backend (after a few epochs). But there is no explicit error message I can share with you.
According to this link, the root cause could be a buggy installation of tensorflow/pytorch due to mixing pip and conda packages (jupyter official image installs tensorflow via pip while the other packages are installed via mamba/conda)
Latest Docker version
- [x] I've updated my Docker version to the latest available, and the issue persists
@mthiboust are you using the latest versions of images?
You can add --pull=always to the docker run command.
Unfortunately, we were not able to install tensorflow using mamba last time we tried (it might have changed).
Could you please check, maybe it started working?
I would be happy to switch to mamba.
I wouldn't touch cuda version though, I don't think it's going to work with mamba.
I am using the tensorflow-notebook:tensorflow-2.16.1 and pytorch-notebook:pytorch-2.2.2 images from 2 days ago (initial post edited).
For your info, this bug is not present when using the jupyter/tensorflow-notebook:tensorflow-2.14.0 older image. I will let you know if I find other insights to better understand this weird behavior.
is it an option to not use conda/mamba in jupter images? I may try it on my side to verify if it corrects the issue
is it an option to not use
conda/mambainjupterimages? I may try it on my side to verify if it corrects the issue
@mthiboust Then I suggest using my/b-data's CUDA-enabled JupyterLab Python docker stack.
What makes this project different:
- Multi-arch:
linux/amd64,linux/arm64/v8- Derived from
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
- including development libraries and headers
- TensortRT and TensorRT plugin libraries
- including development libraries and headers
- IDE: code-server next to JupyterLab
- Just Python – no Conda / Mamba
Python 3.12 images will be updated to CUDA 12.4.0 today and be compatible with PyTorch ≥ 2.2 and TensorFlow ≥ 2.16[.1].
Python 3.11 images will remain with CUDA 11.8.0 and be compatible with PyTorch ≥ 2.0 and TensorFlow 2.12 - 2.14.
Closing this issue as it appears that the cause is on Keras side (cf https://github.com/keras-team/keras/issues/19601)