segger_dev icon indicating copy to clipboard operation
segger_dev copied to clipboard

CUDA error when running predictions without GPU

Open LouisK92 opened this issue 5 months ago • 6 comments

I am running segger in the danielunyi42/segger_dev:cuda121 docker container on a system without gpu.

The training went well by setting

trainer = Trainer(
    accelerator="cpu"
    ...
)

I then attempt predictions with the trained model with

model_version = 0
model_path = MODELS_DIR / "lightning_logs" / f"version_{model_version}"
model = load_model(model_path / "checkpoints")

receptive_field = {'k_bd': 4, 'dist_bd': 12, 'k_tx': 15, 'dist_tx': 3}

segment(
    model,
    dm,
    save_dir=TMP_DIR,
    seg_tag='segger_output',
    transcript_file=TRANSCRIPTS_PARQUET,
    receptive_field=receptive_field,
    min_transcripts=5,
    cell_id_col='segger_cell_id',
    use_cc=False,
    knn_method='kd_tree',
    verbose=True,
)

With this I run into the following error:

Processing Train batches:   0%|                                                                       | 0/1258 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/segger_dev/src/segger/prediction/predict_parquet.py", line 524, in segment
    predict_batch(
  File "/workspace/segger_dev/src/segger/prediction/predict_parquet.py", line 322, in predict_batch
    with cp.cuda.Device(gpu_id):
  File "cupy/cuda/device.pyx", line 173, in cupy.cuda.device.Device.__enter__
  File "cupy_backends/cuda/api/runtime.pyx", line 202, in cupy_backends.cuda.api.runtime.getDevice
  File "cupy_backends/cuda/api/runtime.pyx", line 146, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInsufficientDriver: CUDA driver version is insufficient for CUDA runtime version

I believe the issue could be resolved if there is a way to tell the function to use the cpu. Is this possible in some way?

LouisK92 avatar Aug 06 '25 14:08 LouisK92

there's indeed a way to do this. I recommend cloning and reinstalling the repo and then use segger.prediction.predict_parquet.segment. also recommending using these set of params:

receptive_field = {'k_bd': 4, 'dist_bd': 7.5, 'k_tx': 15, 'dist_tx': 3} # <-- change dist_bd to 7.5 for smaller/more sensible cell radius.

segment(
    model,
    dm,
    score_cut = .75, # <-- add this
    save_dir=TMP_DIR,
    seg_tag='segger_output',
    transcript_file=TRANSCRIPTS_PARQUET,
    receptive_field=receptive_field,
    min_transcripts=5,
    cell_id_col='segger_cell_id',

    use_cc=False,
    knn_method='kd_tree',
    verbose=True,
)

EliHei2 avatar Aug 06 '25 14:08 EliHei2

Thanks!

I was trying the cloning and reinstalling. However, as I use the danielunyi42/segger_dev:cuda121 docker container I ran into the issue

ERROR: Package 'segger' requires a different Python: 3.10.12 not in '>=3.11'

The native python version of ubuntu22.04 is python 3.10, therefore the Dockerfile would need to be adjusted. I tried to build one from scratch:

# Base image with CUDA and cuDNN
FROM nvidia/cuda:12.1.1-cudnn8-runtime-ubuntu22.04

# Install essential tools and Python 3.11 from deadsnakes PPA
RUN apt-get update -y && \
    apt-get install -y --no-install-recommends \
        git \
        wget \
        curl \
        unzip \
        htop \
        vim \
        build-essential \
        software-properties-common && \
    add-apt-repository -y ppa:deadsnakes/ppa && \
    apt-get update -y && \
    apt-get install -y --no-install-recommends \
        python3.11 \
        python3.11-venv \
        python3.11-dev \
        python3-pip && \
    rm -f /usr/bin/python3 && \
    ln -s /usr/bin/python3.11 /usr/bin/python3 && \
    ln -s /usr/bin/python3.11 /usr/bin/python && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

With this the newest segger version can be installed.

My issue now is that I also need spatialdata in the environment for the data handling and I run into the dependency issue described here: https://github.com/EliHei2/segger_dev/issues/123

LouisK92 avatar Aug 07 '25 11:08 LouisK92

For testing I tried to run the code with the latest segger installed from git (leaving out my spatialdata data prep). I still ran into the same CUDARuntimeError

LouisK92 avatar Aug 07 '25 19:08 LouisK92

@LouisK92 for now, the prediction step is only available using GPUs (requiring cuda), this is however, not extremely essential, meaning that the model does not require GPUs in theory, but we assumed that given the training, people would anyways have/need GPUs. I will make a PR at some point to circumvent this requirement, but for now we do require GPUs. Would this be not available in your benchmarking config? in that case I'll prioritise this task.

EliHei2 avatar Aug 08 '25 07:08 EliHei2

@daniel-unyi-42 could you look into the docker issues?

EliHei2 avatar Aug 08 '25 07:08 EliHei2

For running the large scale benchmark gpus are available on my side. I am just limited on the development and testing side as the openproblems GitHub action tests only run on cpu and the local tests require docker containers which I can't use on our cluster, that's why I develop locally on a Mac without Nvidia gpu. I can do some work arounds by disabling the tests and testing on the cluster with a converted singularity container. But it would be super helpful if I could run the tests on cpu

LouisK92 avatar Aug 08 '25 08:08 LouisK92