Eval bug: MUSA backend cause non-sense output on unsloth/deepseek-r1 quantized model
Name and Version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 MUSA devices: Device 0: MTT S4000, compute capability 2.2, VMM: yes Device 1: MTT S4000, compute capability 2.2, VMM: yes Device 2: MTT S4000, compute capability 2.2, VMM: yes Device 3: MTT S4000, compute capability 2.2, VMM: yes Device 4: MTT S4000, compute capability 2.2, VMM: yes Device 5: MTT S4000, compute capability 2.2, VMM: yes Device 6: MTT S4000, compute capability 2.2, VMM: yes Device 7: MTT S4000, compute capability 2.2, VMM: yes version: 5058 (6bf28f01) built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
Musa
Hardware
2x Hygon C86 7385 32C 8x MTT S4000 Ubuntu 24.04 with 5.15.0-136-generic kernel
Models
https://huggingface.co/unsloth/DeepSeek-R1-GGUF DeepSeek-R1-UD-IQ2_XXS
Problem description & steps to reproduce
export HF_ENDPOINT=https://hf-mirror.com
/root/llama.cpp/build/bin/llama-cli \
--model /opt/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS-00001-of-00004.gguf \
--cache-type-k q4_0 \
--threads 32 -no-cnv \
--n-gpu-layers 70 \
--flash-attn \
--temp 0.6 \
--ctx-size 16384 \
--prompt "<|User|>你好啊.<|Assistant|>"
First Bad Commit
No response
Relevant log output
(base) root@s4000-8gpu:~# bash start.sh
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 MUSA devices:
Device 0: MTT S4000, compute capability 2.2, VMM: yes
Device 1: MTT S4000, compute capability 2.2, VMM: yes
Device 2: MTT S4000, compute capability 2.2, VMM: yes
Device 3: MTT S4000, compute capability 2.2, VMM: yes
Device 4: MTT S4000, compute capability 2.2, VMM: yes
Device 5: MTT S4000, compute capability 2.2, VMM: yes
Device 6: MTT S4000, compute capability 2.2, VMM: yes
Device 7: MTT S4000, compute capability 2.2, VMM: yes
build: 5058 (6bf28f01) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device MUSA0 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA1 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA2 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA3 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA4 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA5 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA6 (MTT S4000) - 49055 MiB free
llama_model_load_from_file_impl: using device MUSA7 (MTT S4000) - 49055 MiB free
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /opt/DeepSeek-R1-UD-IQ2_XXS/DeepSeek-R1-UD-IQ2_XXS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = deepseek2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = DeepSeek R1 BF16
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 256x20B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: deepseek2.block_count u32 = 61
llama_model_loader: - kv 7: deepseek2.context_length u32 = 163840
llama_model_loader: - kv 8: deepseek2.embedding_length u32 = 7168
llama_model_loader: - kv 9: deepseek2.feed_forward_length u32 = 18432
llama_model_loader: - kv 10: deepseek2.attention.head_count u32 = 128
llama_model_loader: - kv 11: deepseek2.attention.head_count_kv u32 = 128
llama_model_loader: - kv 12: deepseek2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 13: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: deepseek2.expert_used_count u32 = 8
llama_model_loader: - kv 15: deepseek2.leading_dense_block_count u32 = 3
llama_model_loader: - kv 16: deepseek2.vocab_size u32 = 129280
llama_model_loader: - kv 17: deepseek2.attention.q_lora_rank u32 = 1536
llama_model_loader: - kv 18: deepseek2.attention.kv_lora_rank u32 = 512
llama_model_loader: - kv 19: deepseek2.attention.key_length u32 = 192
llama_model_loader: - kv 20: deepseek2.attention.value_length u32 = 128
llama_model_loader: - kv 21: deepseek2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 22: deepseek2.expert_count u32 = 256
llama_model_loader: - kv 23: deepseek2.expert_shared_count u32 = 1
llama_model_loader: - kv 24: deepseek2.expert_weights_scale f32 = 2.500000
llama_model_loader: - kv 25: deepseek2.expert_weights_norm bool = true
llama_model_loader: - kv 26: deepseek2.expert_gating_func u32 = 2
llama_model_loader: - kv 27: deepseek2.rope.dimension_count u32 = 64
llama_model_loader: - kv 28: deepseek2.rope.scaling.type str = yarn
llama_model_loader: - kv 29: deepseek2.rope.scaling.factor f32 = 40.000000
llama_model_loader: - kv 30: deepseek2.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 31: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = deepseek-v3
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 0
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 1
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 128815
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 41: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 42: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 43: general.quantization_version u32 = 2
llama_model_loader: - kv 44: general.file_type u32 = 19
llama_model_loader: - kv 45: quantize.imatrix.file str = DeepSeek-R1.imatrix
llama_model_loader: - kv 46: quantize.imatrix.dataset str = /training_data/calibration_datav3.txt
llama_model_loader: - kv 47: quantize.imatrix.entries_count i32 = 720
llama_model_loader: - kv 48: quantize.imatrix.chunks_count i32 = 124
llama_model_loader: - kv 49: split.no u16 = 0
llama_model_loader: - kv 50: split.tensors.count i32 = 1025
llama_model_loader: - kv 51: split.count u16 = 4
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q2_K: 55 tensors
llama_model_loader: - type q3_K: 3 tensors
llama_model_loader: - type q4_K: 190 tensors
llama_model_loader: - type q5_K: 116 tensors
llama_model_loader: - type q6_K: 184 tensors
llama_model_loader: - type iq2_xxs: 116 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ2_XXS - 2.0625 bpw
print_info: file size = 182.69 GiB (2.34 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 819
load: token to piece cache size = 0.8223 MB
print_info: arch = deepseek2
print_info: vocab_only = 0
print_info: n_ctx_train = 163840
print_info: n_embd = 7168
print_info: n_layer = 61
print_info: n_head = 128
print_info: n_head_kv = 128
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 24576
print_info: n_embd_v_gqa = 16384
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18432
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = yarn
print_info: freq_base_train = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 671B
print_info: model params = 671.03 B
print_info: general.name = DeepSeek R1 BF16
print_info: n_layer_dense_lead = 3
print_info: n_lora_q = 1536
print_info: n_lora_kv = 512
print_info: n_ff_exp = 2048
print_info: n_expert_shared = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm = 1
print_info: expert_gating_func = sigmoid
print_info: rope_yarn_log_mul = 0.1000
print_info: vocab type = BPE
print_info: n_vocab = 129280
print_info: n_merges = 127741
print_info: BOS token = 0 '<|begin▁of▁sentence|>'
print_info: EOS token = 1 '<|end▁of▁sentence|>'
print_info: EOT token = 1 '<|end▁of▁sentence|>'
print_info: PAD token = 128815 '<|PAD▁TOKEN|>'
print_info: LF token = 201 'Ċ'
print_info: FIM PRE token = 128801 '<|fim▁begin|>'
print_info: FIM SUF token = 128800 '<|fim▁hole|>'
print_info: FIM MID token = 128802 '<|fim▁end|>'
print_info: EOG token = 1 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: CPU_Mapped model buffer size = 10257.32 MiB
load_tensors: MUSA0 model buffer size = 13360.59 MiB
load_tensors: MUSA1 model buffer size = 25338.47 MiB
load_tensors: MUSA2 model buffer size = 25338.47 MiB
load_tensors: MUSA3 model buffer size = 25338.47 MiB
load_tensors: MUSA4 model buffer size = 22171.16 MiB
load_tensors: MUSA5 model buffer size = 25338.47 MiB
load_tensors: MUSA6 model buffer size = 25338.47 MiB
load_tensors: MUSA7 model buffer size = 19728.83 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 10000.0
llama_context: freq_scale = 0.025
llama_context: n_ctx_per_seq (16384) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context: MUSA_Host output buffer size = 0.49 MiB
init: kv_size = 16384, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 61, can_shift = 0
init: MUSA0 KV buffer size = 5824.00 MiB
init: MUSA1 KV buffer size = 5824.00 MiB
init: MUSA2 KV buffer size = 5824.00 MiB
init: MUSA3 KV buffer size = 5824.00 MiB
init: MUSA4 KV buffer size = 5096.00 MiB
init: MUSA5 KV buffer size = 5824.00 MiB
init: MUSA6 KV buffer size = 5824.00 MiB
init: MUSA7 KV buffer size = 4368.00 MiB
llama_context: KV self size = 44408.00 MiB, K (q4_0): 13176.00 MiB, V (f16): 31232.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
llama_context: MUSA0 compute buffer size = 2802.01 MiB
llama_context: MUSA1 compute buffer size = 1330.01 MiB
llama_context: MUSA2 compute buffer size = 1330.01 MiB
llama_context: MUSA3 compute buffer size = 1330.01 MiB
llama_context: MUSA4 compute buffer size = 1202.01 MiB
llama_context: MUSA5 compute buffer size = 1330.01 MiB
llama_context: MUSA6 compute buffer size = 1330.01 MiB
llama_context: MUSA7 compute buffer size = 1118.52 MiB
llama_context: MUSA_Host compute buffer size = 190038.01 MiB
llama_context: graph nodes = 4843
llama_context: graph splits = 137
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 16384
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 32
system_info: n_threads = 32 (n_threads_batch = 32) / 64 | MUSA : PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
sampler seed: 1126874445
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 16384
top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.600
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 16384, n_batch = 2048, n_predict = -1, n_keep = 1
你好啊.DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
GPUs got randomly busy while keeping output "DDDDD"
(base) root@s4000-8gpu:~# mthreads-gmi -pm -l 2
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 6%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 46%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 83%
1 3230 66% 87%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 52%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 65%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 85%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 90%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 65%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 75%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 74%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 29%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
gpu pid memUtil gpuUtil
0 3230 44% 0%
1 3230 66% 0%
2 3230 66% 0%
3 3230 66% 0%
4 3230 58% 0%
5 3230 66% 0%
6 3230 66% 0%
7 3230 51% 0%
Could you please provide your kernel version, MTGPU driver version, and MUSA toolkits version? This will help us investigate the issue further.
kernel version ubuntu-22.044-LTS linux-5.15.0-136-generic
drivers and utils inside musasdk_rc3.1.0.zip from moorethreads official site.
kernel version ubuntu-22.044-LTS linux-5.15.0-136-generic
drivers and utils inside musasdk_rc3.1.0.zip from moorethreads official site.
We highly recommend switching to MUSA SDK rc3.1.1, as all of our latest testing has been conducted against this version.
I've just sent you an email — feel free to continue the conversation on WeChat.
rc3.1.1 installed and rebuiding...
Confirmed it's related to multi-gpu and musa sdk rc3.1.1, which does not have offical support for S4000 especially multi S4000 gpus. Waiting for a driver update.
This issue was closed because it has been inactive for 14 days since being marked as stale.