Can not run multi GPU inference, DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
I have a server with three RTX 3090 24GB GPUs. When trying to run inference with more than one GPU this fails with CUDA invalid value.
Runtime output
user@hostname ~/mistral.rs (master) [SIGINT]> CUDA_VISIBLE_DEVICES=0,1,2 mistralrs-server --port 8083 -n 0:1 -n 1:1 run -m microsoft/Phi-3.5-MoE-instruct
2025-08-03T10:21:21.022133Z INFO mistralrs_server_core::mistralrs_for_server_builder: avx: true, neon: false, simd128: false, f16c: true
2025-08-03T10:21:21.022151Z INFO mistralrs_server_core::mistralrs_for_server_builder: Sampling method: penalties -> temperature -> topk -> topp -> minp -> multinomial
2025-08-03T10:21:21.022164Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model kind is: normal (no adapters)
2025-08-03T10:21:21.027663Z INFO hf_hub: Using token file found "/home/mfurse/.cache/huggingface/token"
2025-08-03T10:21:21.028177Z INFO hf_hub: Using token file found "/home/mfurse/.cache/huggingface/token"
2025-08-03T10:21:21.028217Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.028245Z INFO mistralrs_core::pipeline::normal: Loading `config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.029864Z INFO mistralrs_core::pipeline::paths: Read from cache file "/home/mfurse/.cache/huggingface/hub/microsoft-Phi-3.5-MoE-instruct_repo_list.json"
2025-08-03T10:21:21.030091Z INFO mistralrs_core::pipeline::paths: Found model weight filenames ["model-00001-of-00017.safetensors", "model-00002-of-00017.safetensors", "model-00003-of-00017.safetensors", "model-00004-of-00017.safetensors", "model-00005-of-00017.safetensors", "model-00006-of-00017.safetensors", "model-00007-of-00017.safetensors", "model-00008-of-00017.safetensors", "model-00009-of-00017.safetensors", "model-00010-of-00017.safetensors", "model-00011-of-00017.safetensors", "model-00012-of-00017.safetensors", "model-00013-of-00017.safetensors", "model-00014-of-00017.safetensors", "model-00015-of-00017.safetensors", "model-00016-of-00017.safetensors", "model-00017-of-00017.safetensors"]
2025-08-03T10:21:21.030348Z INFO mistralrs_core::pipeline::normal: Read from cache file "/home/mfurse/.cache/huggingface/hub/microsoft-Phi-3.5-MoE-instruct_repo_list.json"
2025-08-03T10:21:21.030356Z INFO mistralrs_core::pipeline::normal: Loading `generation_config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.030371Z INFO mistralrs_core::pipeline::normal: Loading `tokenizer_config.json` at `microsoft/Phi-3.5-MoE-instruct`
2025-08-03T10:21:21.030394Z INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2025-08-03T10:21:21.291282Z INFO mistralrs_quant::utils::log: Model has 32 repeating layers.
2025-08-03T10:21:21.291490Z INFO mistralrs_quant::utils::log: Loading model according to the following repeating layer mappings:
2025-08-03T10:21:21.291524Z INFO mistralrs_quant::utils::log: Layers 0-0: cuda[0] (24 GB)
2025-08-03T10:21:21.291540Z INFO mistralrs_quant::utils::log: Layers 1-1: cuda[1] (24 GB)
2025-08-03T10:21:21.413031Z INFO mistralrs_quant::utils::log: Layers 2-31: cpu (437 GB)
2025-08-03T10:21:21.505684Z INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.6
2025-08-03T10:21:21.836126Z INFO mistralrs_core::utils::normal: DType selected is F16.
2025-08-03T10:21:21.836156Z WARN mistralrs_core::pipeline::normal: Device mapping contains a mix of GPU and CPU. There is no CPU support for PagedAttention, disabling PagedAttention.
2025-08-03T10:21:21.836195Z INFO mistralrs_core::pipeline::normal: Model config: Config { vocab_size: 32064, hidden_act: Silu, hidden_size: 4096, intermediate_size: 6400, num_hidden_layers: 32, num_attention_heads: 32, num_key_value_heads: 8, rms_norm_eps: 1e-5, rope_theta: 10000.0, rope_scaling: Some(Classic { short_factor: [1.0, 1.0399999618530271, 1.0399999618530271, 1.0399999618530271, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.059999942779541, 1.059999942779541, 1.069999933242798, 1.069999933242798, 1.069999933242798, 1.069999933242798, 1.1399999856948853, 1.159999966621399, 1.159999966621399, 1.159999966621399, 1.159999966621399, 1.1799999475479126, 1.1999999284744265, 1.319999933242798, 1.3399999141693115, 1.3499999046325684, 1.3999998569488523, 1.4799998998641968, 1.4999998807907104, 1.589999794960022, 1.6499998569488523, 1.71999990940094, 1.8999998569488523, 1.9099998474121096, 1.9099998474121096, 1.9899998903274536, 1.9999998807907104, 1.9999998807907104, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.009999990463257, 2.0999999046325684, 2.319999933242798, 2.419999837875366, 2.5899999141693115, 2.7899999618530273], long_factor: [1.0199999809265137, 1.0299999713897705, 1.0399999618530271, 1.0499999523162842, 1.0499999523162842, 1.0499999523162842, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.059999942779541, 1.0999999046325684, 1.1799999475479126, 1.1799999475479126, 1.3700000047683716, 1.4899998903274536, 2.109999895095825, 2.8899998664855957, 3.9499998092651367, 4.299999713897705, 6.429999828338623, 8.09000015258789, 10.690000534057615, 12.050000190734863, 18.229999542236328, 18.84000015258789, 19.899999618530273, 21.420000076293945, 26.200000762939453, 34.28000259399414, 34.590003967285156, 38.730003356933594, 40.22000503540039, 42.54000473022461, 44.000003814697266, 47.59000396728515, 54.750003814697266, 56.19000244140625, 57.44000244140625, 57.4900016784668, 61.20000076293945, 61.540000915527344, 61.75, 61.779998779296875, 62.06999969482422, 63.11000061035156, 63.43000030517578, 63.560001373291016, 63.71000289916992, 63.92000198364258, 63.94000244140625, 63.94000244140625, 63.96000289916992, 63.980003356933594, 64.0300064086914, 64.0300064086914, 64.0300064086914, 64.04000854492188, 64.10000610351563, 64.19000244140625, 64.20999908447266, 64.75, 64.95999908447266], scaling_type: Su }), max_position_embeddings: 131072, sliding_window: Some(131072), original_max_position_embeddings: 4096, quantization_config: None, lm_head_bias: true, attention_bias: true, num_local_experts: 16, router_jitter_noise: 0.01, tie_word_embeddings: false }
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 89/89 [00:05<00:00, 10.59it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:06<00:00, 10.49it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 11.70it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 13.96it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:06<00:00, 12.51it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:07<00:00, 9.29it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 112/112 [00:07<00:00, 6.28it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:07<00:00, 8.27it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 5.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 6.85it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 5.75it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 7.45it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 4.68it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:08<00:00, 4.92it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:09<00:00, 4.38it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 118/118 [00:09<00:00, 1993.35it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 110/110 [00:10<00:00, 19.17it/s]
2025-08-03T10:21:41.942442Z INFO mistralrs_core::pipeline::chat_template: bos_toks = "<s>", eos_toks = "<|endoftext|>", "<|assistant|>", "<|end|>", unk_tok = <unk>
2025-08-03T10:21:41.952245Z INFO mistralrs_server_core::mistralrs_for_server_builder: Model loaded.
2025-08-03T10:21:41.952306Z INFO mistralrs_core: Pipeline input modalities are [📝 Text]
2025-08-03T10:21:41.952311Z INFO mistralrs_core: Pipeline output modalities are [📝 Text]
2025-08-03T10:21:41.952378Z INFO mistralrs_core: Beginning dummy run.
2025-08-03T10:21:41.966479Z INFO mistralrs_core::prefix_cacher: PrefixCacherV2 is enabled. Expect higher multi-turn throughput for both text and multimodal.
thread '<unnamed>' panicked at mistralrs-core/src/kv_cache/mod.rs:513:26:
Could not prepare cache: DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2025-08-03T10:21:41.967256Z WARN mistralrs_core: Dummy run failed!
2025-08-03T10:21:41.968128Z INFO mistralrs_server: OpenAI-compatible server listening on http://0.0.0.0:8083.
Latest commit or version
I am tracking the master branch, with this build against a2fc13217.
I had to add /usr/local/cuda/bin to the path as using NVCC_CCBIN did not allow me to build.
NVCC_CCBIN=/usr/local/cuda/bin/nvcc cargo install --path mistralrs-server --features "cuda flash-attn"
The compiler and the driver versions match:
user@hostname ~> nvidia-smi --version
NVIDIA-SMI version : 575.57.08
NVML version : 575.57
DRIVER version : 575.57.08
CUDA Version : 12.9
user@hostname ~> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
When running with NCCL I don't see this issue, but with the default allocation, which I want to be able to split larger models on CPU, I do.
@mfurseman - any chance you might be able to reproduce with RUST_BACKTRACE=full? Secondly, does this happen when you build with NCCL but run without it using MISTRALRS_NO_NCCL=1?
Sure, here's the full backtrace:
2025-08-05T21:47:59.632696Z ERROR mistralrs_core::engine: prompt step - Model failed with error: DriverError(CUDA_ERROR_INVALID_VALUE, "invalid argument")
0: candle_core::error::Error::bt
1: <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_cpu_storage
2: candle_core::tensor::Tensor::to_device
3: mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::make_prompt_chunk
4: mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::get_prompt_input
5: <mistralrs_core::pipeline::inputs_processor::text_models_inputs_processor::TextInputsProcessor as mistralrs_core::pipeline::inputs_processor::InputsProcessor>::process_inputs
6: mistralrs_core::pipeline::Pipeline::step::{{closure}}
7: mistralrs_core::engine::Engine::run::{{closure}}
8: std::sys::backtrace::__rust_begin_short_backtrace
9: core::ops::function::FnOnce::call_once{{vtable.shim}}
10: std::sys::pal::unix::thread::Thread::new::thread_start
11: start_thread
at ./nptl/pthread_create.c:448:8
12: __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
0: candle_core::error::Error::bt
1: candle_core::error::Error::msg
2: mistralrs_core::pipeline::Pipeline::step::{{closure}}
3: mistralrs_core::engine::Engine::run::{{closure}}
4: std::sys::backtrace::__rust_begin_short_backtrace
5: core::ops::function::FnOnce::call_once{{vtable.shim}}
6: std::sys::pal::unix::thread::Thread::new::thread_start
7: start_thread
at ./nptl/pthread_create.c:448:8
8: __GI___clone3
at ./misc/../sysdeps/unix/sysv/linux/x86_64/clone3.S:78:0
2025-08-05T21:47:59.632799Z INFO mistralrs_core: Dummy run completed in 0.09776427s.
Error: Address already in use (os error 98)
Stack backtrace:
0: anyhow::error::<impl core::convert::From<E> for anyhow::Error>::from
1: mistralrs_server::main::{{closure}}.70681
2: mistralrs_server::main
3: std::sys::backtrace::__rust_begin_short_backtrace
4: std::rt::lang_start::{{closure}}
5: std::rt::lang_start_internal
6: main
7: __libc_start_call_main
at ./csu/../sysdeps/nptl/libc_start_call_main.h:58:16
8: __libc_start_main_impl
at ./csu/../csu/libc-start.c:360:3
9: _start
This does still happen with the nccl feature in the build and MISTRALRS_NO_NCCL=1 set.
Thank you. Curious about the <candle_core::cuda_backend::CudaStorage as candle_core::backend::BackendStorage>::to_cpu_storage mechanics for using host mem vs device memory. Starting to get my bearings a bit better in the layout of the libraries involved, will try to figure out how that works as shared mem is pretty much a requirement for anyone without their own servers and any sufficiently complex task to pawn off on the machine.
Can confirm encountering the same when trying to use RTX 5070 Ti 16GB with RTX 3060 12GB. Can test any fix on that when available.
Incidentally, since from that I understand, this happens with the Ring backend that's supposed to support heterogeneous cards/nodes I have a tangential question - are there any plans for more allocation options? For example I was thinking about allocating more of the KV cache to the weaker card (though I don't know if that makes sense?).
I have the same question.
My server has eight RTX3090,i also used this command: cargo build --release --features cuda,then i have a model (Qwen3-code-30b),I try to use "RUST_BACKTRACE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./mistralrs-server --prefix-cache-n 2 --num-device-layers "0:6;1:6;2:6;3:6;4:6;5:6;6:6;7:6" run --model-id Qwen3-Coder-30B-A3B-Instruct/ --dtype auto --max-seq-len 1024 --max-batch-size 1" which happend errors like yours
gpt answers me this errors due to the argument,but i change these arguments which not works
gpt answers me this errors due to the argument,but i change these arguments which not works
I also encountered this issue, solved with a PR #1723 by adding a flag to explicitly denote the number of GPUs used for NCCL. By default, it loads model on only one GPU on CUDA, it reports exactly the same error if we disable NCCL with the env flag.