mistral.rs icon indicating copy to clipboard operation
mistral.rs copied to clipboard

Can not run multi GPU inference, DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")

Open mfurseman opened this issue 4 months ago • 8 comments

I have a server with three RTX 3090 24GB GPUs. When trying to run inference with more than one GPU this fails with CUDA invalid value.

Runtime output

user@hostname ~/mistral.rs (master) [SIGINT]> CUDA_VISIBLE_DEVICES=0,1,2  mistralrs-server --port 8083 -n 0:1 -n 1:1   run  -m microsoft/Phi-3.5-MoE-instruct
2025-08-03T10:21:21.022133Z  INFO mistralrs_server_core::mistralrs_for_server_builder: avx: true, neon: false, simd128: false, f16c: true
2025-08-03T10:21:21.022151Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-08-03T10:21:21.022164Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-08-03T10:21:21.027663Z  INFO hf_hub: Using token file found "/home/mfurse/.cache/huggingface/token"    
2025-08-03T10:21:21.028177Z  INFO hf_hub: Using token file found "/home/mfurse/.cache/huggingface/token"    
2025-08-03T10:21:21.028217Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.028245Z  INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.029864Z  INFO mistralrs_core::pipeline::paths: Read from cache file "/home/mfurse/.cache/huggingface/hub/microsoft-Phi-3.5-MoE-instruct_repo_list.json"
2025-08-03T10:21:21.030091Z  INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00017.safetensors", "model-00002-of-00017.safetensors", "model-00003-of-00017.safetensors", "model-00004-of-00017.safetensors", "model-00005-of-00017.safetensors", "model-00006-of-00017.safetensors", "model-00007-of-00017.safetensors", "model-00008-of-00017.safetensors", "model-00009-of-00017.safetensors", "model-00010-of-00017.safetensors", "model-00011-of-00017.safetensors", "model-00012-of-00017.safetensors", "model-00013-of-00017.safetensors", "model-00014-of-00017.safetensors", "model-00015-of-00017.safetensors", "model-00016-of-00017.safetensors", "model-00017-of-00017.safetensors"]
2025-08-03T10:21:21.030348Z  INFO mistralrs_core::pipeline::normal: Read from cache file "/home/mfurse/.cache/huggingface/hub/microsoft-Phi-3.5-MoE-instruct_repo_list.json"
2025-08-03T10:21:21.030356Z  INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.030371Z  INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.030394Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-08-03T10:21:21.291282Z  INFO mistralrs_quant::utils::log: Model has 32 repeating layers.
2025-08-03T10:21:21.291490Z  INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-08-03T10:21:21.291524Z  INFO mistralrs_quant::utils::log: Layers 0-0: cuda[0] (24 GB)
2025-08-03T10:21:21.291540Z  INFO mistralrs_quant::utils::log: Layers 1-1: cuda[1] (24 GB)
2025-08-03T10:21:21.413031Z  INFO mistralrs_quant::utils::log: Layers 2-31: cpu (437 GB)
2025-08-03T10:21:21.505684Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-08-03T10:21:21.836126Z  INFO mistralrs_core::utils::normal: DType selected is F16.
2025-08-03T10:21:21.836156Z  WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-08-03T10:21:21.836195Z  INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 4096, intermediate_size: 6400, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 10000.0, rope_scaling: Some(Classic { short_factor: [1.0, 1.0399999618530271, 1.0399999618530271, 1.0399999618530271, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.059999942779541, 1.059999942779541, 1.069999933242798, 1.069999933242798, 1.069999933242798, 1.069999933242798, 1.1399999856948853, 1.159999966621399, 1.159999966621399, 1.159999966621399, 1.159999966621399, 1.1799999475479126, 1.1999999284744265, 1.319999933242798, 1.3399999141693115, 1.3499999046325684, 1.3999998569488523, 1.4799998998641968, 1.4999998807907104, 1.589999794960022, 1.6499998569488523, 1.71999990940094, 1.8999998569488523, 1.9099998474121096, 1.9099998474121096, 1.9899998903274536, 1.9999998807907104, 1.9999998807907104, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.0999999046325684, 2.319999933242798, 2.419999837875366, 2.5899999141693115, 2.7899999618530273], long_factor: [1.0199999809265137, 1.0299999713897705, 1.0399999618530271, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.0999999046325684, 1.1799999475479126, 1.1799999475479126, 1.3700000047683716, 1.4899998903274536, 2.109999895095825, 2.8899998664855957, 3.9499998092651367, 4.299999713897705, 6.429999828338623, 8.09000015258789, 10.690000534057615, 12.050000190734863, 18.229999542236328, 18.84000015258789, 19.899999618530273, 21.420000076293945, 26.200000762939453, 34.28000259399414, 34.590003967285156, 38.730003356933594, 40.22000503540039, 42.54000473022461, 44.000003814697266, 47.59000396728515, 54.750003814697266, 56.19000244140625, 57.44000244140625, 57.4900016784668, 61.20000076293945, 61.540000915527344, 61.75, 61.779998779296875, 62.06999969482422, 63.11000061035156, 63.43000030517578, 63.560001373291016, 63.71000289916992, 63.92000198364258, 63.94000244140625, 63.94000244140625, 63.96000289916992, 63.980003356933594, 64.0300064086914, 64.0300064086914, 64.0300064086914, 64.04000854492188, 64.10000610351563, 64.19000244140625, 64.20999908447266, 64.75, 64.95999908447266], scaling_type: Su }), max_position_embeddings: 131072, sliding_window: Some(131072), original_max_position_embeddings: 4096, quantization_config: None, lm_head_bias: true, attention_bias: true, num_local_experts: 16, router_jitter_noise: 0.01, tie_word_embeddings: false }
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 89/89 [00:05<00:00, 10.59it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:06<00:00, 10.49it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 11.70it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 13.96it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 12.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:07<00:00, 9.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:07<00:00, 6.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:07<00:00, 8.27it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 5.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 6.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 5.75it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 7.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 4.68it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 4.92it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:09<00:00, 4.38it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:09<00:00, 1993.35it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:10<00:00, 19.17it/s]
2025-08-03T10:21:41.942442Z  INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|assistant|>", "<|end|>", unk_tok = <unk>
2025-08-03T10:21:41.952245Z  INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-08-03T10:21:41.952306Z  INFO mistralrs_core: Pipeline input modalities are [📝 Text]
2025-08-03T10:21:41.952311Z  INFO mistralrs_core: Pipeline output modalities are [📝 Text]
2025-08-03T10:21:41.952378Z  INFO mistralrs_core: Beginning dummy run.
2025-08-03T10:21:41.966479Z  INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.

thread '<unnamed>' panicked at mistralrs-core/src/kv_cache/mod.rs:513:26:
Could not prepare cache: DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-08-03T10:21:41.967256Z  WARN mistralrs_core: Dummy run failed!
2025-08-03T10:21:41.968128Z  INFO mistralrs_server: OpenAI-compatible server listening on http://0.0.0.0:8083.

Latest commit or version

I am tracking the master branch, with this build against a2fc13217.

I had to add /usr/local/cuda/bin to the path as using NVCC_CCBIN did not allow me to build.

NVCC_CCBIN=/usr/local/cuda/bin/nvcc cargo install --path mistralrs-server  --features "cuda flash-attn"

The compiler and the driver versions match:

user@hostname ~> nvidia-smi --version                                                                           
NVIDIA-SMI version  : 575.57.08
NVML version        : 575.57
DRIVER version      : 575.57.08
CUDA Version        : 12.9
user@hostname ~> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0

When running with NCCL I don't see this issue, but with the default allocation, which I want to be able to split larger models on CPU, I do.

mfurseman avatar Aug 03 '25 12:08 mfurseman

@mfurseman - any chance you might be able to reproduce with RUST_BACKTRACE=full? Secondly, does this happen when you build with NCCL but run without it using MISTRALRS_NO_NCCL=1?

sempervictus avatar Aug 05 '25 15:08 sempervictus

Sure, here's the full backtrace:

2025-08-05T21:47:59.632696Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
   0: candle_core::error::Error::bt
   1: <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_cpu_storage
   2: candle_core::tensor::Tensor::to_device
   3: mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::make_prompt_chunk
   4: mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::get_prompt_input
   5: <mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::TextInputsProcessor as mistralrs_core::pipeline::inputs_processor::InputsProcessor>::process_inputs
   6: mistralrs_core::pipeline::Pipeline::step::{{closure}}
   7: mistralrs_core::engine::Engine::run::{{closure}}
   8: std::sys::backtrace::__rust_begin_short_backtrace
   9: core::ops::function::FnOnce::call_once{{vtable.shim}}
  10: std::sys::pal::unix::thread::Thread::new::thread_start
  11: start_thread
             at ./nptl/pthread_create.c:448:8
  12: __GI___clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0

   0: candle_core::error::Error::bt
   1: candle_core::error::Error::msg
   2: mistralrs_core::pipeline::Pipeline::step::{{closure}}
   3: mistralrs_core::engine::Engine::run::{{closure}}
   4: std::sys::backtrace::__rust_begin_short_backtrace
   5: core::ops::function::FnOnce::call_once{{vtable.shim}}
   6: std::sys::pal::unix::thread::Thread::new::thread_start
   7: start_thread
             at ./nptl/pthread_create.c:448:8
   8: __GI___clone3
             at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0

2025-08-05T21:47:59.632799Z  INFO mistralrs_core: Dummy run completed in 0.09776427s.
Error: Address already in use (os error 98)

Stack backtrace:
   0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
   1: mistralrs_server::main::{{closure}}.70681
   2: mistralrs_server::main
   3: std::sys::backtrace::__rust_begin_short_backtrace
   4: std::rt::lang_start::{{closure}}
   5: std::rt::lang_start_internal
   6: main
   7: __libc_start_call_main
             at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
   8: __libc_start_main_impl
             at ./csu/../csu/libc-start.c:360:3
   9: _start

This does still happen with the nccl feature in the build and MISTRALRS_NO_NCCL=1 set.

mfurseman avatar Aug 05 '25 22:08 mfurseman

Thank you. Curious about the <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_cpu_storage mechanics for using host mem vs device memory. Starting to get my bearings a bit better in the layout of the libraries involved, will try to figure out how that works as shared mem is pretty much a requirement for anyone without their own servers and any sufficiently complex task to pawn off on the machine.

sempervictus avatar Aug 05 '25 22:08 sempervictus

Can confirm encountering the same when trying to use RTX 5070 Ti 16GB with RTX 3060 12GB. Can test any fix on that when available.

Incidentally, since from that I understand, this happens with the Ring backend that's supposed to support heterogeneous cards/nodes I have a tangential question - are there any plans for more allocation options? For example I was thinking about allocating more of the KV cache to the weaker card (though I don't know if that makes sense?).

jaen avatar Nov 01 '25 16:11 jaen

I have the same question.

jed-hacker avatar Nov 13 '25 10:11 jed-hacker

My server has eight RTX3090,i also used this command: cargo build --release --features cuda,then i have a model (Qwen3-code-30b),I try to use "RUST_BACKTRACE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./mistralrs-server --prefix-cache-n 2 --num-device-layers "0:6;1:6;2:6;3:6;4:6;5:6;6:6;7:6" run --model-id Qwen3-Coder-30B-A3B-Instruct/ --dtype auto --max-seq-len 1024 --max-batch-size 1" which happend errors like yours

jed-hacker avatar Nov 13 '25 10:11 jed-hacker

gpt answers me this errors due to the argument,but i change these arguments which not works

jed-hacker avatar Nov 13 '25 10:11 jed-hacker

gpt answers me this errors due to the argument,but i change these arguments which not works

I also encountered this issue, solved with a PR #1723 by adding a flag to explicitly denote the number of GPUs used for NCCL. By default, it loads model on only one GPU on CUDA, it reports exactly the same error if we disable NCCL with the env flag.

guoqingbao avatar Nov 19 '25 13:11 guoqingbao