whisper.cpp CUDA error running in docker container

I want to run whisper-server in docker on a jetson orin NX.

Outside of docker, everything works fine.

Inside docker, the model is successfully loaded on the GPU, but the app crashes when running inference:

CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at /app/ggml/src/ggml-cuda/../ggml-cuda/common.cuh:689
  cublasCreate_v2(&cublas_handles[device])
/app/ggml/src/ggml-cuda/ggml-cuda.cu:72: CUDA error

Memory usage is around 1.6G/15.3G, and I'm using the tiny model, so we can rule out any out of memory errors.

I've modified the original dockerfile based on the newer llamacpp docker file, mainly so that it directly starts the server. I don't think that would introduce any issues though.

I'm currently lost as to what could be the problem here.. I've tried different CUDA versions (12.3.1, 12.6.0, 12.6.3; the host is on jetpack 6.1 with 12.6), and about everything I could think of.

I'm running another container with llamacpp on the same device, and there everything works just fine.

The whole log for completeness:

whisper_init_from_file_with_params_no_state: loading model from '/models/ggml-tiny.bin'
whisper_init_with_params_no_state: use gpu    = 1
whisper_init_with_params_no_state: flash attn = 1
whisper_init_with_params_no_state: gpu_device = 0
whisper_init_with_params_no_state: dtw        = 0
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
whisper_init_with_params_no_state: devices    = 2
whisper_init_with_params_no_state: backends   = 2
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 384
whisper_model_load: n_audio_head  = 6
whisper_model_load: n_audio_layer = 4
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 384
whisper_model_load: n_text_head   = 6
whisper_model_load: n_text_layer  = 4
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 1 (tiny)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:    CUDA0 total size =    77.11 MB
whisper_model_load: model size    =   77.11 MB
whisper_backend_init_gpu: using CUDA0 backend
whisper_init_state: kv self size  =    3.15 MB
whisper_init_state: kv cross size =    9.44 MB
whisper_init_state: kv pad  size  =    2.36 MB
whisper_init_state: compute buffer (conv)   =   14.15 MB
whisper_init_state: compute buffer (encode) =   17.70 MB
whisper_init_state: compute buffer (cross)  =    3.88 MB
whisper_init_state: compute buffer (decode) =   96.81 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | COREML = 0 | OPENVINO = 0 |

operator(): processing 'sample.wav' (47488 samples, 3.0 sec), 4 threads, 1 processors, lang = auto, task = transcribe, timestamps = 0 ...


whisper server listening at http://0.0.0.0:8080

Received request: sample.wav
Successfully loaded sample.wav
Running whisper.cpp inference on sample.wav
CUDA error: the resource allocation failed
  current device: 0, in function cublas_handle at /app/ggml/src/ggml-cuda/../ggml-cuda/common.cuh:689
  cublasCreate_v2(&cublas_handles[device])
/app/ggml/src/ggml-cuda/ggml-cuda.cu:72: CUDA error

Feb 11 '25 14:02 Tharit

For completeness, I've tried with v1.7.4 and the latest head, same result.

Would be great if somebody could give any pointers

Feb 11 '25 19:02 Tharit

Tried again with the latest head, same error. Still would highly appreciate any pointers/ideas/help!

May 12 '25 19:05 Tharit

@Tharit Could you show us the dockerfile that is being used?

I sounds like this only happens on a Jetson Orin which I don't have access to I'm afraid so, but I noticed that flash-attention is enabled, could you try disabling it just to see if that makes a difference?

May 14 '25 04:05 danbev

@danbev Thank for your support!

I tried without flash-attention as well, same error. It's also not related to the audio, or the server.. whisper-bench crashes with the same error. I built the image with different CUDA versions (tried 12.3.1, 12.6, 12.6.1; host is on 12.6.1), and with different cuda arch (default, or 87, matching device capabilities).. no differences.

Outside the container, it works, inside, it does not.

The dockerfile is:

ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=12.6.1
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}

ARG BASE_CUDA_RUN_CONTAINER=nvidia/cuda:${CUDA_VERSION}-runtime-ubuntu${UBUNTU_VERSION}

FROM ${BASE_CUDA_DEV_CONTAINER} AS build

# CUDA architecture to build for (defaults to all supported archs)
ARG CUDA_DOCKER_ARCH=default

RUN apt-get update && \
    apt-get install -y build-essential cmake libgomp1 git

WORKDIR /app

COPY . .

RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
    export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
    fi && \
    cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=1 ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    cmake --build build --config Release -j$(nproc)

RUN mkdir -p /app/lib && \
    find build -name "*.so*" -exec cp {} /app/lib \;

RUN mkdir -p /app/full \
    && cp build/bin/* /app/full

## Base image
FROM ${BASE_CUDA_RUN_CONTAINER} AS base

RUN apt-get update \
    && apt-get install -y libgomp1 curl\
    && apt autoremove -y \
    && apt clean -y \
    && rm -rf /tmp/* /var/tmp/* \
    && find /var/cache/apt/archives /var/lib/apt/lists -not -name lock -type f -delete \
    && find /var/cache -type f -delete

COPY --from=build /app/lib/ /app

### Server, Server only
FROM base AS server

COPY --from=build /app/full /app

WORKDIR /app

HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/" ]

ENTRYPOINT [ "/app/whisper-server" ]

May 14 '25 09:05 Tharit

Thanks for the docker file information. I've given it a try and I'm able to run it and I can't reproduce the error (I was not expecting to as I don't have the hardware).

Could run the following and see if it produces a similar error as the one from ggml:

cat > test_cublas.cu << 'EOF'
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include <stdio.h>

int main() {
    cublasHandle_t handle;
    cublasStatus_t status = cublasCreate(&handle);
    printf("cublasCreate status: %d\n", status);
    if (status == CUBLAS_STATUS_SUCCESS) {
        printf("CUBLAS initialized successfully\n");
        cublasDestroy(handle);
    }
    return 0;
}
EOF
$ nvcc test_cublas.cu -lcublas -o test_cublas
$ ./test_cublas

May 14 '25 13:05 danbev

@danbev Thanks, this helped me to finally find the solution!!

I tried your code snippet, and in fact it did work outside of the container, but not inside.. same error as with whisper.cpp (allocation failed, code 3). So obviously there is some issue with the container..

With this new insight, I quickly realized that apparently on jetson devices the "nvidia/cuda" base images do not fully work.. cuda apparently works, and that's "enough" to run llama.cpp, but cublas does not, and apparently neither does cudnn.

Solution: Use the nvcr.io/nvidia/l4t-jetpack base images (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/l4t-jetpack) on jetson, depending on the L4T version flashed, e.g. jetpack 6.1 is 36.4.0

It might be worth putting that in the docs and/or the docker file somewhere to save others from falling into this trap..

May 15 '25 14:05 Tharit