Eval bug: Crashes when model is loaded across a Vega VII card with Mi50s
Name and Version
build: 6963 (6db3d1ffe) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu
Operating systems
Linux
GGML backends
HIP
Hardware
Vega VII + 2 x Mi50s
Models
No response
Problem description & steps to reproduce
When I run any model, of any size, split across the Vega VII and either or both of the Mi50s, this error presents. I can run inference on the Vega VII fine and either or both of the Mi50s fine but I cannot run them mixed. After tracing it down with AI, the issue seems to be related to the fact that although all of these cards use gfx906, the Vega VII does not have ECC RAM but the Mi50s do, leading to this issue. It seems it is not possible to compile with both ecc and non-ecc versions of gfx906. ROCm has flags for both but those flags (gfx906:sramecc- and gfx906:sramecc+) are not exposed, perhaps, in the build commands on llama cpp?
I do not believe this fact matters very much to the issue, but the Vega VII is an MPX module in a 2019 Mac Pro. I am using Pop OS 22 with patches from T2 Linux and ROCm 7.0.1 with the tensile fix (copying the tensile files from a rocblas build for gfx906). I have tested this in ROCm 7.1, 6.4, 6.3 and 6.2 and the crash has never changed.
In this particular case, I compiled with HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)"
cmake -S . -B build -DGGML_HIP=ON -DGPU_TARGETS=gfx906 -DCMAKE_BUILD_TYPE=Release
&& cmake --build build --config Release -- -j 16
First Bad Commit
No response
Relevant log output
~/Desktop/LLAMA_NEW/llama.cpp/build/bin$ ./llama-server -m /home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf -ngl 3
0 -c 128000
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 ROCm devices:
Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64
Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64
main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this)
build: 6963 (6db3d1ffe) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu
system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model '/home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:09:00.0) - 32728 MiB free
llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon Graphics) (0000:10:00.0) - 32728 MiB free
llama_model_load_from_file_impl: using device ROCm2 (AMD Radeon Graphics) (0000:16:00.0) - 32728 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 44 key-value pairs and 809 tensors from /home/name/Downloads/MiniMax-M2-UD-IQ3_XXS-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = minimax-m2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Minimax-M2
llama_model_loader: - kv 3: general.basename str = Minimax-M2
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 256x4.9B
llama_model_loader: - kv 6: general.license str = mit
llama_model_loader: - kv 7: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 8: general.tags arr[str,1] = ["text-generation"]
llama_model_loader: - kv 9: minimax-m2.block_count u32 = 62
llama_model_loader: - kv 10: minimax-m2.context_length u32 = 196608
llama_model_loader: - kv 11: minimax-m2.embedding_length u32 = 3072
llama_model_loader: - kv 12: minimax-m2.feed_forward_length u32 = 1536
llama_model_loader: - kv 13: minimax-m2.attention.head_count u32 = 48
llama_model_loader: - kv 14: minimax-m2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: minimax-m2.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 16: minimax-m2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 17: minimax-m2.expert_count u32 = 256
llama_model_loader: - kv 18: minimax-m2.expert_used_count u32 = 8
llama_model_loader: - kv 19: minimax-m2.attention.key_length u32 = 128
llama_model_loader: - kv 20: minimax-m2.attention.value_length u32 = 128
llama_model_loader: - kv 21: minimax-m2.expert_gating_func u32 = 2
llama_model_loader: - kv 22: minimax-m2.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 23: minimax-m2.rope.dimension_count u32 = 64
llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 25: tokenizer.ggml.pre str = minimax-m2
llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,200064] = ["Ā", "ā", "Ă", "ă", "Ą", "ą", ...
llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,200064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,199744] = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "e r...
llama_model_loader: - kv 29: tokenizer.ggml.bos_token_id u32 = 200034
llama_model_loader: - kv 30: tokenizer.ggml.eos_token_id u32 = 200020
llama_model_loader: - kv 31: tokenizer.ggml.unknown_token_id u32 = 200021
llama_model_loader: - kv 32: tokenizer.ggml.padding_token_id u32 = 200004
llama_model_loader: - kv 33: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 34: tokenizer.chat_template str = {# Unsloth & community template fixes...
llama_model_loader: - kv 35: general.quantization_version u32 = 2
llama_model_loader: - kv 36: general.file_type u32 = 23
llama_model_loader: - kv 37: quantize.imatrix.file str = MiniMax-M2-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv 38: quantize.imatrix.dataset str = unsloth_calibration_MiniMax-M2.txt
llama_model_loader: - kv 39: quantize.imatrix.entries_count u32 = 496
llama_model_loader: - kv 40: quantize.imatrix.chunks_count u32 = 697
llama_model_loader: - kv 41: split.no u16 = 0
llama_model_loader: - kv 42: split.tensors.count i32 = 809
llama_model_loader: - kv 43: split.count u16 = 2
llama_model_loader: - type f32: 373 tensors
llama_model_loader: - type q4_K: 1 tensors
llama_model_loader: - type q5_K: 20 tensors
llama_model_loader: - type q6_K: 11 tensors
llama_model_loader: - type iq3_xxs: 128 tensors
llama_model_loader: - type iq3_s: 44 tensors
llama_model_loader: - type iq4_xs: 232 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ3_XXS - 3.0625 bpw
print_info: file size = 87.17 GiB (3.27 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: printing all EOG tokens:
load: - 200004 ('<fim_pad>')
load: - 200005 ('<reponame>')
load: - 200020 ('[e~[')
load: special tokens cache size = 54
load: token to piece cache size = 1.3355 MB
print_info: arch = minimax-m2
print_info: vocab_only = 0
print_info: n_ctx_train = 196608
print_info: n_embd = 3072
print_info: n_layer = 62
print_info: n_head = 48
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 6
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 1536
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 196608
print_info: rope_finetuned = unknown
print_info: model type = 230B.A10B
print_info: model params = 228.69 B
print_info: general.name = Minimax-M2
print_info: vocab type = BPE
print_info: n_vocab = 200064
print_info: n_merges = 199744
print_info: BOS token = 200034 ']~!b['
print_info: EOS token = 200020 '[e~['
print_info: UNK token = 200021 ']!d~['
print_info: PAD token = 200004 '<fim_pad>'
print_info: LF token = 10 'Ċ'
print_info: FIM PRE token = 200001 '<fim_prefix>'
print_info: FIM SUF token = 200003 '<fim_suffix>'
print_info: FIM MID token = 200002 '<fim_middle>'
print_info: FIM PAD token = 200004 '<fim_pad>'
print_info: FIM REP token = 200005 '<reponame>'
print_info: EOG token = 200004 '<fim_pad>'
print_info: EOG token = 200005 '<reponame>'
print_info: EOG token = 200020 '[e~['
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/63 layers to GPU
load_tensors: CPU_Mapped model buffer size = 45174.95 MiB
load_tensors: ROCm0 model buffer size = 14651.21 MiB
load_tensors: ROCm1 model buffer size = 14307.85 MiB
load_tensors: ROCm2 model buffer size = 15129.94 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 128000
llama_context: n_ctx_seq = 128000
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = true
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (128000) < n_ctx_train (196608) -- the full capacity of the model will not be utilized
llama_context: CPU output buffer size = 3.05 MiB
llama_kv_cache: CPU KV buffer size = 16000.00 MiB
llama_kv_cache: ROCm0 KV buffer size = 5000.00 MiB
llama_kv_cache: ROCm1 KV buffer size = 5000.00 MiB
llama_kv_cache: ROCm2 KV buffer size = 5000.00 MiB
llama_kv_cache: size = 31000.00 MiB (128000 cells, 62 layers, 4/1 seqs), K (f16): 15500.00 MiB, V (f16): 15500.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: ROCm0 compute buffer size = 1532.56 MiB
llama_context: ROCm1 compute buffer size = 210.51 MiB
llama_context: ROCm2 compute buffer size = 210.51 MiB
llama_context: ROCm_Host compute buffer size = 256.01 MiB
llama_context: graph nodes = 3975
llama_context: graph splits = 486 (with bs=512), 5 (with bs=1)
common_init_from_params: added <fim_pad> logit bias = -inf
common_init_from_params: added <reponame> logit bias = -inf
common_init_from_params: added [e~[ logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 128000
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 503
ggml_cuda_compute_forward: ADD failed
ROCm error: invalid device function
current device: 0, in function ggml_cuda_compute_forward at /home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2722
/home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:90: ROCm error
err
[New LWP 1370285]
[New LWP 1370288]
[New LWP 1370289]
[New LWP 1370290]
[New LWP 1370291]
[New LWP 1370292]
[New LWP 1370293]
[New LWP 1370294]
[New LWP 1370295]
[New LWP 1370296]
[New LWP 1370297]
[New LWP 1370298]
[New LWP 1370299]
[New LWP 1370300]
[New LWP 1370301]
[New LWP 1370302]
[New LWP 1370303]
[New LWP 1370304]
[New LWP 1370305]
[New LWP 1370306]
[New LWP 1370307]
[New LWP 1370308]
[New LWP 1370309]
[New LWP 1370310]
[New LWP 1370311]
[New LWP 1370312]
[New LWP 1370314]
[New LWP 1370326]
[New LWP 1370327]
[New LWP 1370328]
[New LWP 1370329]
[New LWP 1370330]
[New LWP 1370331]
[New LWP 1370332]
[New LWP 1370333]
[New LWP 1370334]
[New LWP 1370335]
[New LWP 1370336]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007313506ea42f in __GI___wait4 (pid=1370353, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0 0x00007313506ea42f in __GI___wait4 (pid=1370353, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30 in ../sysdeps/unix/sysv/linux/wait4.c
#1 0x0000731350d7058b in ggml_print_backtrace () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#2 0x0000731350d70723 in ggml_abort () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#3 0x000073134f85def2 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#4 0x000073134f865a54 in evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#5 0x000073134f8630bf in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-hip.so
#6 0x0000731350d8be57 in ggml_backend_sched_graph_compute_async () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libggml-base.so
#7 0x0000731350ea0811 in llama_context::graph_compute(ggml_cgraph*, bool) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#8 0x0000731350ea20cc in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#9 0x0000731350ea7cb9 in llama_context::decode(llama_batch const&) () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#10 0x0000731350ea8c2f in llama_decode () from /home/name/Desktop/LLAMA_NEW/llama.cpp/build/bin/libllama.so
#11 0x0000561f239cc7a8 in common_init_from_params(common_params&) ()
#12 0x0000561f2389f349 in server_context::load_model(common_params const&) ()
#13 0x0000561f238327e8 in main ()
[Inferior 1 (process 1370284) detached]
Aborted (core dumped)
Had the same issue
Code compiled for the gfx906 llvm target (with no qualifiers) runs on both sramecc enabled and disabled modes (as well as both xnac states) so no its not that.
could you try with rocr compiled with https://github.com/ROCm/rocm-systems/commit/12430fe25a91faff276c65ace2019d42927351a2 applied?
@cfreeamd perhaps you could comment on this being a problem you are aware of.
I was able to rebuild ROCm with that patch applied. The error remains the same. I tried things like using -fa on and off but it did not help. I tried with -sm row as well and that didn't help. I can still inference on the VegaVII and both Mi50s, just not together. I don't know if this is related but I am also not able to in vulkan either. Rather than crash, Vulkan produces gibberish at about 2 tokens per second.
These tests were against the newest version of llama.cpp. ROCm was freshly compiled. Vulkan was with the current release build: b7054. In the case of Vulkan, it produces gibberish on the VegaVII by itself and when used with the other cards. It worked fine until a breaking build of llama.cpp came out a couple of months ago but I don't know which one it was.
ROCm Patch:
name@pop-os:~/Desktop/LLAMA_NEW/llama.cpp/build/bin$ ./llama-server -m /home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 99 -c 50000 -fa on ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 3 ROCm devices: Device 0: AMD Radeon Graphics, gfx906:sramecc-:xnack- (0x906), VMM: no, Wave Size: 64 Device 1: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 Device 2: AMD Radeon Graphics, gfx906:sramecc+:xnack- (0x906), VMM: no, Wave Size: 64 main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this) build: 7055 (f1bad23f8) with cc (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0 for x86_64-linux-gnu system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23 main: loading model srv load_model: loading model '/home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf' llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) (0000:09:00.0) - 32728 MiB free llama_model_load_from_file_impl: using device ROCm1 (AMD Radeon Graphics) (0000:10:00.0) - 32728 MiB free llama_model_load_from_file_impl: using device ROCm2 (AMD Radeon Graphics) (0000:16:00.0) - 32728 MiB free llama_model_loader: additional 1 GGUFs metadata loaded. llama_model_loader: loaded meta data with 36 key-value pairs and 687 tensors from /home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gpt-oss llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Openai_Gpt Oss 120b llama_model_loader: - kv 3: general.basename str = openai_gpt-oss llama_model_loader: - kv 4: general.size_label str = 120B llama_model_loader: - kv 5: gpt-oss.block_count u32 = 36 llama_model_loader: - kv 6: gpt-oss.context_length u32 = 131072 llama_model_loader: - kv 7: gpt-oss.embedding_length u32 = 2880 llama_model_loader: - kv 8: gpt-oss.feed_forward_length u32 = 2880 llama_model_loader: - kv 9: gpt-oss.attention.head_count u32 = 64 llama_model_loader: - kv 10: gpt-oss.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: gpt-oss.rope.freq_base f32 = 150000.000000 llama_model_loader: - kv 12: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 13: gpt-oss.expert_count u32 = 128 llama_model_loader: - kv 14: gpt-oss.expert_used_count u32 = 4 llama_model_loader: - kv 15: gpt-oss.attention.key_length u32 = 64 llama_model_loader: - kv 16: gpt-oss.attention.value_length u32 = 64 llama_model_loader: - kv 17: gpt-oss.attention.sliding_window u32 = 128 llama_model_loader: - kv 18: gpt-oss.expert_feed_forward_length u32 = 2880 llama_model_loader: - kv 19: gpt-oss.rope.scaling.type str = yarn llama_model_loader: - kv 20: gpt-oss.rope.scaling.factor f32 = 32.000000 llama_model_loader: - kv 21: gpt-oss.rope.scaling.original_context_length u32 = 4096 llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 23: tokenizer.ggml.pre str = gpt-4o llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,201088] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 199998 llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 200002 llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 199999 llama_model_loader: - kv 30: tokenizer.chat_template str = {#-\n In addition to the normal input... llama_model_loader: - kv 31: general.quantization_version u32 = 2 llama_model_loader: - kv 32: general.file_type u32 = 38 llama_model_loader: - kv 33: split.no u16 = 0 llama_model_loader: - kv 34: split.tensors.count i32 = 687 llama_model_loader: - kv 35: split.count u16 = 2 llama_model_loader: - type f32: 433 tensors llama_model_loader: - type q8_0: 146 tensors llama_model_loader: - type mxfp4: 108 tensors print_info: file format = GGUF V3 (latest) print_info: file type = MXFP4 MoE print_info: file size = 59.02 GiB (4.34 BPW) load: printing all EOG tokens: load: - 199999 ('<|endoftext|>') load: - 200002 ('<|return|>') load: - 200007 ('<|end|>') load: - 200012 ('<|call|>') load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list load: special tokens cache size = 21 load: token to piece cache size = 1.3332 MB print_info: arch = gpt-oss print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 2880 print_info: n_embd_inp = 2880 print_info: n_layer = 36 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 64 print_info: n_swa = 128 print_info: is_swa_any = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 512 print_info: n_embd_v_gqa = 512 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 2880 print_info: n_expert = 128 print_info: n_expert_used = 4 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = yarn print_info: freq_base_train = 150000.0 print_info: freq_scale_train = 0.03125 print_info: n_ctx_orig_yarn = 4096 print_info: rope_finetuned = unknown print_info: model type = 120B print_info: model params = 116.83 B print_info: general.name = Openai_Gpt Oss 120b print_info: n_ff_exp = 2880 print_info: vocab type = BPE print_info: n_vocab = 201088 print_info: n_merges = 446189 print_info: BOS token = 199998 '<|startoftext|>' print_info: EOS token = 200002 '<|return|>' print_info: EOT token = 199999 '<|endoftext|>' print_info: PAD token = 199999 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 199999 '<|endoftext|>' print_info: EOG token = 200002 '<|return|>' print_info: EOG token = 200012 '<|call|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 36 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 37/37 layers to GPU load_tensors: CPU_Mapped model buffer size = 586.82 MiB load_tensors: ROCm0 model buffer size = 21401.19 MiB load_tensors: ROCm1 model buffer size = 19754.95 MiB load_tensors: ROCm2 model buffer size = 18695.54 MiB ..........................................................................................srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 503 .......... llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 50176 llama_context: n_ctx_seq = 50176 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = enabled llama_context: kv_unified = true llama_context: freq_base = 150000.0 llama_context: freq_scale = 0.03125 llama_context: n_ctx_seq (50176) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: ROCm_Host output buffer size = 3.07 MiB llama_kv_cache_iswa: creating non-SWA KV cache, size = 50176 cells llama_kv_cache: ROCm0 KV buffer size = 588.00 MiB llama_kv_cache: ROCm1 KV buffer size = 588.00 MiB llama_kv_cache: ROCm2 KV buffer size = 588.00 MiB llama_kv_cache: size = 1764.00 MiB ( 50176 cells, 18 layers, 4/1 seqs), K (f16): 882.00 MiB, V (f16): 882.00 MiB llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells llama_kv_cache: ROCm0 KV buffer size = 14.00 MiB llama_kv_cache: ROCm1 KV buffer size = 12.00 MiB llama_kv_cache: ROCm2 KV buffer size = 10.00 MiB llama_kv_cache: size = 36.00 MiB ( 1024 cells, 18 layers, 4/1 seqs), K (f16): 18.00 MiB, V (f16): 18.00 MiB llama_context: pipeline parallelism enabled (n_copies=4) llama_context: ROCm0 compute buffer size = 567.32 MiB llama_context: ROCm1 compute buffer size = 309.32 MiB llama_context: ROCm2 compute buffer size = 620.95 MiB llama_context: ROCm_Host compute buffer size = 405.71 MiB llama_context: graph nodes = 2024 llama_context: graph splits = 4 common_init_from_params: added <|endoftext|> logit bias = -inf common_init_from_params: added <|return|> logit bias = -inf common_init_from_params: added <|call|> logit bias = -inf common_init_from_params: setting dry_penalty_last_n to ctx_size = 50176 common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable) /home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:89: ROCm error ggml_cuda_compute_forward: MUL_MAT failed ROCm error: invalid device function current device: 0, in function ggml_cuda_compute_forward at /home/name/Desktop/LLAMA_NEW/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2727 err [New LWP 151086] [New LWP 151089] [New LWP 151090] [New LWP 151091] [New LWP 151092] [New LWP 151093] [New LWP 151094] [New LWP 151095] [New LWP 151096] [New LWP 151097] [New LWP 151098] [New LWP 151099] [New LWP 151100] [New LWP 151101] [New LWP 151102] [New LWP 151103] [New LWP 151104] [New LWP 151105] [New LWP 151106] [New LWP 151107] [New LWP 151108] [New LWP 151109] [New LWP 151110] [New LWP 151111] [New LWP 151112] [New LWP 151113] [New LWP 151114] [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 0x0000763d764ea42f in __GI___wait4 (pid=151118, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory. #0 0x0000763d764ea42f in __GI___wait4 (pid=151118, stat_loc=0x0, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30 30 in ../sysdeps/unix/sysv/linux/wait4.c #1 0x0000763d76f365cb in ggml_print_backtrace () from libggml-base.so.0 #2 0x0000763d76f36763 in ggml_abort () from libggml-base.so.0 #3 0x0000763d75865f12 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () from libggml-hip.so.0 #4 0x0000763d7586d9cf in evaluate_and_capture_cuda_graph(ggml_backend_cuda_context*, ggml_cgraph*, bool&, bool&, bool&) () from libggml-hip.so.0 #5 0x0000763d7586b0df in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) () from libggml-hip.so.0 #6 0x0000763d76f52327 in ggml_backend_sched_graph_compute_async () from libggml-base.so.0 #7 0x0000763d76ca0811 in llama_context::graph_compute(ggml_cgraph*, bool) () from libllama.so.0 #8 0x0000763d76ca20cc in llama_context::process_ubatch(llama_ubatch const&, llm_graph_type, llama_memory_context_i*, ggml_status&) () from libllama.so.0 #9 0x0000763d76ca7cf1 in llama_context::decode(llama_batch const&) () from libllama.so.0 #10 0x0000763d76ca8c5f in llama_decode () from libllama.so.0 #11 0x0000631d495431c8 in common_init_from_params(common_params&) () #12 0x0000631d49434b4a in server_context::load_model(common_params const&) () #13 0x0000631d493e1afe in main () [Inferior 1 (process 151085) detached] Aborted (core dumped)
VULKAN:
name@pop-os:~/Desktop/LLAMA_NEW/VULKAN_TEST/build/bin$ ./llama-server -m /home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf -ngl 99 -c 50000 -fa on load_backend: loaded RPC backend from /home/name/Desktop/LLAMA_NEW/VULKAN_TEST/build/bin/libggml-rpc.so ggml_vulkan: Found 3 Vulkan devices: ggml_vulkan: 0 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none ggml_vulkan: 1 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none ggml_vulkan: 2 = AMD Radeon Graphics (RADV VEGA20) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: none load_backend: loaded Vulkan backend from /home/name/Desktop/LLAMA_NEW/VULKAN_TEST/build/bin/libggml-vulkan.so load_backend: loaded CPU backend from /home/name/Desktop/LLAMA_NEW/VULKAN_TEST/build/bin/libggml-cpu-skylakex.so main: setting n_parallel = 4 and kv_unified = true (add -kvu to disable this) build: 7054 (becc4816d) with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu system info: n_threads = 12, n_threads_batch = 12, total_threads = 24
system_info: n_threads = 12 (n_threads_batch = 12) / 24 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
main: binding port with default address family
main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 23
main: loading model
srv load_model: loading model '/home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf'
llama_model_load_from_file_impl: using device Vulkan0 (AMD Radeon Graphics (RADV VEGA20)) (0000:09:00.0) - 31998 MiB free
llama_model_load_from_file_impl: using device Vulkan1 (AMD Radeon Graphics (RADV VEGA20)) (0000:10:00.0) - 32751 MiB free
llama_model_load_from_file_impl: using device Vulkan2 (AMD Radeon Graphics (RADV VEGA20)) (0000:16:00.0) - 32741 MiB free
llama_model_loader: additional 1 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 36 key-value pairs and 687 tensors from /home/name/Downloads/gpt-oss-120b-MXFP4-00001-of-00002.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gpt-oss
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Openai_Gpt Oss 120b
llama_model_loader: - kv 3: general.basename str = openai_gpt-oss
llama_model_loader: - kv 4: general.size_label str = 120B
llama_model_loader: - kv 5: gpt-oss.block_count u32 = 36
llama_model_loader: - kv 6: gpt-oss.context_length u32 = 131072
llama_model_loader: - kv 7: gpt-oss.embedding_length u32 = 2880
llama_model_loader: - kv 8: gpt-oss.feed_forward_length u32 = 2880
llama_model_loader: - kv 9: gpt-oss.attention.head_count u32 = 64
llama_model_loader: - kv 10: gpt-oss.attention.head_count_kv u32 = 8
llama_model_loader: - kv 11: gpt-oss.rope.freq_base f32 = 150000.000000
llama_model_loader: - kv 12: gpt-oss.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 13: gpt-oss.expert_count u32 = 128
llama_model_loader: - kv 14: gpt-oss.expert_used_count u32 = 4
llama_model_loader: - kv 15: gpt-oss.attention.key_length u32 = 64
llama_model_loader: - kv 16: gpt-oss.attention.value_length u32 = 64
llama_model_loader: - kv 17: gpt-oss.attention.sliding_window u32 = 128
llama_model_loader: - kv 18: gpt-oss.expert_feed_forward_length u32 = 2880
llama_model_loader: - kv 19: gpt-oss.rope.scaling.type str = yarn
llama_model_loader: - kv 20: gpt-oss.rope.scaling.factor f32 = 32.000000
llama_model_loader: - kv 21: gpt-oss.rope.scaling.original_context_length u32 = 4096
llama_model_loader: - kv 22: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 23: tokenizer.ggml.pre str = gpt-4o
llama_model_loader: - kv 24: tokenizer.ggml.tokens arr[str,201088] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 25: tokenizer.ggml.token_type arr[i32,201088] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 26: tokenizer.ggml.merges arr[str,446189] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 27: tokenizer.ggml.bos_token_id u32 = 199998
llama_model_loader: - kv 28: tokenizer.ggml.eos_token_id u32 = 200002
llama_model_loader: - kv 29: tokenizer.ggml.padding_token_id u32 = 199999
llama_model_loader: - kv 30: tokenizer.chat_template str = {#-\n In addition to the normal input...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 38
llama_model_loader: - kv 33: split.no u16 = 0
llama_model_loader: - kv 34: split.tensors.count i32 = 687
llama_model_loader: - kv 35: split.count u16 = 2
llama_model_loader: - type f32: 433 tensors
llama_model_loader: - type q8_0: 146 tensors
llama_model_loader: - type mxfp4: 108 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = MXFP4 MoE
print_info: file size = 59.02 GiB (4.34 BPW)
load: printing all EOG tokens:
load: - 199999 ('<|endoftext|>')
load: - 200002 ('<|return|>')
load: - 200007 ('<|end|>')
load: - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch = gpt-oss
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 2880
print_info: n_embd_inp = 2880
print_info: n_layer = 36
print_info: n_head = 64
print_info: n_head_kv = 8
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 2880
print_info: n_expert = 128
print_info: n_expert_used = 4
print_info: n_expert_groups = 0
print_info: n_group_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = yarn
print_info: freq_base_train = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn = 4096
print_info: rope_finetuned = unknown
print_info: model type = 120B
print_info: model params = 116.83 B
print_info: general.name = Openai_Gpt Oss 120b
print_info: n_ff_exp = 2880
print_info: vocab type = BPE
print_info: n_vocab = 201088
print_info: n_merges = 446189
print_info: BOS token = 199998 '<|startoftext|>'
print_info: EOS token = 200002 '<|return|>'
print_info: EOT token = 199999 '<|endoftext|>'
print_info: PAD token = 199999 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 199999 '<|endoftext|>'
print_info: EOG token = 200002 '<|return|>'
print_info: EOG token = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 36 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors: CPU_Mapped model buffer size = 586.82 MiB
load_tensors: Vulkan0 model buffer size = 21401.18 MiB
load_tensors: Vulkan1 model buffer size = 19754.94 MiB
load_tensors: Vulkan2 model buffer size = 18695.53 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 50176
llama_context: n_ctx_seq = 50176
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 150000.0
llama_context: freq_scale = 0.03125
llama_context: n_ctx_seq (50176) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 3.07 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 50176 cells
llama_kv_cache: Vulkan0 KV buffer size = 588.00 MiB
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 503
llama_kv_cache: Vulkan1 KV buffer size = 588.00 MiB
llama_kv_cache: Vulkan2 KV buffer size = 588.00 MiB
llama_kv_cache: size = 1764.00 MiB ( 50176 cells, 18 layers, 4/1 seqs), K (f16): 882.00 MiB, V (f16): 882.00 MiB
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: Vulkan0 KV buffer size = 14.00 MiB
llama_kv_cache: Vulkan1 KV buffer size = 12.00 MiB
llama_kv_cache: Vulkan2 KV buffer size = 10.00 MiB
llama_kv_cache: size = 36.00 MiB ( 1024 cells, 18 layers, 4/1 seqs), K (f16): 18.00 MiB, V (f16): 18.00 MiB
llama_context: Vulkan0 compute buffer size = 187.77 MiB
llama_context: Vulkan1 compute buffer size = 136.77 MiB
llama_context: Vulkan2 compute buffer size = 398.38 MiB
llama_context: Vulkan_Host compute buffer size = 105.65 MiB
llama_context: graph nodes = 2024
llama_context: graph splits = 4
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 50176
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv log_server_r: request: Center** 400
srv init: initializing slots, n_slots = 4
slot init: id 0 | task -1 | new slot, n_ctx = 50176
slot init: id 1 | task -1 | new slot, n_ctx = 50176
slot init: id 2 | task -1 | new slot, n_ctx = 50176
slot init: id 3 | task -1 | new slot, n_ctx = 50176
srv init: prompt cache is enabled, size limit: 8192 MiB
srv init: use --cache-ram 0 to disable the prompt cache
srv init: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv init: thinking = 0
main: model loaded
main: chat template, chat_template: {#-
In addition to the normal inputs of messages and tools, this template also accepts the
following kwargs:
- "builtin_tools": A list, can contain "browser" and/or "python".
- "model_identity": A string that optionally describes the model identity.
- "reasoning_effort": A string that describes the reasoning effort, defaults to "medium". #}
{#- Tool Definition Rendering ============================================== #} {%- macro render_typescript_type(param_spec, required_params, is_nullable=false) -%} {%- if param_spec.type == "array" -%} {%- if param_spec['items'] -%} {%- if param_spec['items']['type'] == "string" -%} {{- "string[]" }} {%- elif param_spec['items']['type'] == "number" -%} {{- "number[]" }} {%- elif param_spec['items']['type'] == "integer" -%} {{- "number[]" }} {%- elif param_spec['items']['type'] == "boolean" -%} {{- "boolean[]" }} {%- else -%} {%- set inner_type = render_typescript_type(param_spec['items'], required_params) -%} {%- if inner_type == "object | object" or inner_type|length > 50 -%} {{- "any[]" }} {%- else -%} {{- inner_type + "[]" }} {%- endif -%} {%- endif -%} {%- if param_spec.nullable -%} {{- " | null" }} {%- endif -%} {%- else -%} {{- "any[]" }} {%- if param_spec.nullable -%} {{- " | null" }} {%- endif -%} {%- endif -%} {%- elif param_spec.type is defined and param_spec.type is iterable and param_spec.type is not string and param_spec.type is not mapping and param_spec.type[0] is defined -%} {#- Handle array of types like ["object", "object"] from Union[dict, list] #} {%- if param_spec.type | length > 1 -%} {{- param_spec.type | join(" | ") }} {%- else -%} {{- param_spec.type[0] }} {%- endif -%} {%- elif param_spec.oneOf -%} {#- Handle oneOf schemas - check for complex unions and fallback to any #} {%- set has_object_variants = false -%} {%- for variant in param_spec.oneOf -%} {%- if variant.type == "object" -%} {%- set has_object_variants = true -%} {%- endif -%} {%- endfor -%} {%- if has_object_variants and param_spec.oneOf|length > 1 -%} {{- "any" }} {%- else -%} {%- for variant in param_spec.oneOf -%} {{- render_typescript_type(variant, required_params) -}} {%- if variant.description %} {{- "// " + variant.description }} {%- endif -%} {%- if variant.default is defined %} {{ "// default: " + variant.default|tojson }} {%- endif -%} {%- if not loop.last %} {{- " | " }} {% endif -%} {%- endfor -%} {%- endif -%} {%- elif param_spec.type == "string" -%} {%- if param_spec.enum -%} {{- '"' + param_spec.enum|join('" | "') + '"' -}} {%- else -%} {{- "string" }} {%- if param_spec.nullable %} {{- " | null" }} {%- endif -%} {%- endif -%} {%- elif param_spec.type == "number" -%} {{- "number" }} {%- elif param_spec.type == "integer" -%} {{- "number" }} {%- elif param_spec.type == "boolean" -%} {{- "boolean" }}
{%- elif param_spec.type == "object" -%}
{%- if param_spec.properties -%}
{{- "{
" }} {%- for prop_name, prop_spec in param_spec.properties.items() -%} {{- prop_name -}} {%- if prop_name not in (param_spec.required or []) -%} {{- "?" }} {%- endif -%} {{- ": " }} {{ render_typescript_type(prop_spec, param_spec.required or []) }} {%- if not loop.last -%} {{-", " }} {%- endif -%} {%- endfor -%} {{- "}" }} {%- else -%} {{- "object" }} {%- endif -%} {%- else -%} {{- "any" }} {%- endif -%} {%- endmacro -%}
{%- macro render_tool_namespace(namespace_name, tools) -%} {{- "## " + namespace_name + "
" }} {{- "namespace " + namespace_name + " {
" }} {%- for tool in tools %} {%- set tool = tool.function %} {{- "// " + tool.description + " " }} {{- "type "+ tool.name + " = " }} {%- if tool.parameters and tool.parameters.properties %} {{- "(_: { " }} {%- for param_name, param_spec in tool.parameters.properties.items() %} {%- if param_spec.description %} {{- "// " + param_spec.description + " " }} {%- endif %} {{- param_name }} {%- if param_name not in (tool.parameters.required or []) -%} {{- "?" }} {%- endif -%} {{- ": " }} {{- render_typescript_type(param_spec, tool.parameters.required or []) }} {%- if param_spec.default is defined -%} {%- if param_spec.enum %} {{- ", // default: " + param_spec.default }} {%- elif param_spec.oneOf %} {{- "// default: " + param_spec.default }} {%- else %} {{- ", // default: " + param_spec.default|tojson }} {%- endif -%} {%- endif -%} {%- if not loop.last %} {{- ", " }} {%- else %} {{- " " }} {%- endif -%} {%- endfor %} {{- "}) => any;
" }} {%- else -%} {{- "() => any;
" }} {%- endif -%} {%- endfor %} {{- "} // namespace " + namespace_name }} {%- endmacro -%}
{%- macro render_builtin_tools(browser_tool, python_tool) -%} {%- if browser_tool %} {{- "## browser
" }}
{{- "// Tool for browsing.
" }}
{{- "// The cursor appears in brackets before each browsing display: [{cursor}].
" }}
{{- "// Cite information from the tool using the following format:
" }}
{{- "// 【{cursor}†L{line_start}(-L{line_end})?】, for example: 【6†L9-L11】 or 【8†L3】.
" }}
{{- "// Do not quote more than 10 words directly from the tool output.
" }}
{{- "// sources=web (default: web)
" }}
{{- "namespace browser {
" }}
{{- "// Searches for information related to query and displays topn results.
" }}
{{- "type search = (_: {
" }}
{{- "query: string,
" }}
{{- "topn?: number, // default: 10
" }}
{{- "source?: string,
" }}
{{- "}) => any;
" }}
{{- "// Opens the link id from the page indicated by cursor starting at line number loc, showing num_lines lines.
" }}
{{- "// Valid link ids are displayed with the formatting: 【{id}†.*】.
" }}
{{- "// If cursor is not provided, the most recent page is implied.
" }}
{{- "// If id is a string, it is treated as a fully qualified URL associated with source.
" }}
{{- "// If loc is not provided, the viewport will be positioned at the beginning of the document or centered on the most relevant passage, if available.
" }}
{{- "// Use this function without id to scroll to a new location of an opened page.
" }}
{{- "type open = (_: {
" }}
{{- "id?: number | string, // default: -1
" }}
{{- "cursor?: number, // default: -1
" }}
{{- "loc?: number, // default: -1
" }}
{{- "num_lines?: number, // default: -1
" }}
{{- "view_source?: boolean, // default: false
" }}
{{- "source?: string,
" }}
{{- "}) => any;
" }}
{{- "// Finds exact matches of pattern in the current page, or the page given by cursor.
" }}
{{- "type find = (_: {
" }}
{{- "pattern: string,
" }}
{{- "cursor?: number, // default: -1
" }}
{{- "}) => any;
" }} {{- "} // namespace browser
" }} {%- endif -%}
{%- if python_tool %}
{{- "## python
" }} {{- "Use this tool to execute Python code in your chain of thought. The code will not be shown to the user. This tool should be used for internal reasoning, but not for code that is intended to be visible to the user (e.g. when creating plots, tables, or files).
" }} {{- "When you send a message containing Python code to python, it will be executed in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 120.0 seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is UNKNOWN. Depends on the cluster.
" }} {%- endif -%} {%- endmacro -%}
{#- System Message Construction ============================================ #} {%- macro build_system_message() -%} {%- if model_identity is not defined %} {%- set model_identity = "You are ChatGPT, a large language model trained by OpenAI." %} {%- endif %} {{- model_identity + " " }} {{- "Knowledge cutoff: 2024-06 " }} {{- "Current date: " + strftime_now("%Y-%m-%d") + "
" }} {%- if reasoning_effort is not defined %} {%- set reasoning_effort = "medium" %} {%- endif %} {{- "Reasoning: " + reasoning_effort + "
" }} {%- if builtin_tools %} {{- "# Tools
" }} {%- set available_builtin_tools = namespace(browser=false, python=false) %} {%- for tool in builtin_tools %} {%- if tool == "browser" %} {%- set available_builtin_tools.browser = true %} {%- elif tool == "python" %} {%- set available_builtin_tools.python = true %} {%- endif %} {%- endfor %} {{- render_builtin_tools(available_builtin_tools.browser, available_builtin_tools.python) }} {%- endif -%} {{- "# Valid channels: analysis, commentary, final. Channel must be included for every message." }} {%- if tools -%} {{- " Calls to these tools must go to the commentary channel: 'functions'." }} {%- endif -%} {%- endmacro -%}
{#- Main Template Logic ================================================= #} {#- Set defaults #}
{#- Render system message #} {{- "<|start|>system<|message|>" }} {{- build_system_message() }} {{- "<|end|>" }}
{#- Extract developer message #} {%- if messages[0].role == "developer" or messages[0].role == "system" %} {%- set developer_message = messages[0].content %} {%- set loop_messages = messages[1:] %} {%- else %} {%- set developer_message = "" %} {%- set loop_messages = messages %} {%- endif %}
{#- Render developer message #} {%- if developer_message or tools %} {{- "<|start|>developer<|message|>" }} {%- if developer_message %} {{- "# Instructions
" }} {{- developer_message }} {%- endif %} {%- if tools -%} {{- "
" }} {{- "# Tools
" }} {{- render_tool_namespace("functions", tools) }} {%- endif -%} {{- "<|end|>" }} {%- endif %}
{#- Render messages #} {%- set last_tool_call = namespace(name=none) %} {%- for message in loop_messages -%} {#- At this point only assistant/user/tool messages should remain #} {%- if message.role == 'assistant' -%} {#- Checks to ensure the messages are being passed in the format we expect #} {%- if "content" in message %} {%- if false %} {{- raise_exception("You have passed a message containing <|channel|> tags in the content field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }} {%- endif %} {%- endif %} {%- if "thinking" in message %} {%- if "<|channel|>analysis<|message|>" in message.thinking or "<|channel|>final<|message|>" in message.thinking %} {{- raise_exception("You have passed a message containing <|channel|> tags in the thinking field. Instead of doing this, you should pass analysis messages (the string between '<|message|>' and '<|end|>') in the 'thinking' field, and final messages (the string between '<|message|>' and '<|end|>') in the 'content' field.") }} {%- endif %} {%- endif %} {%- if "tool_calls" in message %} {#- We assume max 1 tool call per message, and so we infer the tool call name #} {#- in "tool" messages from the most recent assistant tool call name #} {%- set tool_call = message.tool_calls[0] %} {%- if tool_call.function %} {%- set tool_call = tool_call.function %} {%- endif %} {%- if message.content and message.thinking %} {{- raise_exception("Cannot pass both content and thinking in an assistant message with tool calls! Put the analysis message in one or the other, but not both.") }} {%- elif message.content %} {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.content + "<|end|>" }} {%- elif message.thinking %} {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }} {%- endif %} {{- "<|start|>assistant to=" }} {{- "functions." + tool_call.name + "<|channel|>commentary " }} {{- (tool_call.content_type if tool_call.content_type is defined else "json") + "<|message|>" }} {{- tool_call.arguments|tojson }} {{- "<|call|>" }} {%- set last_tool_call.name = tool_call.name %} {%- elif loop.last and not add_generation_prompt %} {#- Only render the CoT if the final turn is an assistant turn and add_generation_prompt is false #} {#- This is a situation that should only occur in training, never in inference. #} {%- if "thinking" in message %} {{- "<|start|>assistant<|channel|>analysis<|message|>" + message.thinking + "<|end|>" }} {%- endif %} {#- <|return|> indicates the end of generation, but <|end|> does not #} {#- <|return|> should never be an input to the model, but we include it as the final token #} {#- when training, so the model learns to emit it. #} {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|return|>" }} {%- else %} {#- CoT is dropped during all previous turns, so we never render it for inference #} {{- "<|start|>assistant<|channel|>final<|message|>" + message.content + "<|end|>" }} {%- set last_tool_call.name = none %} {%- endif %} {%- elif message.role == 'tool' -%} {%- if last_tool_call.name is none %} {{- raise_exception("Message has tool role, but there was no previous assistant message with a tool call!") }} {%- endif %} {{- "<|start|>functions." + last_tool_call.name }} {{- " to=assistant<|channel|>commentary<|message|>" + message.content|tojson + "<|end|>" }} {%- elif message.role == 'user' -%} {{- "<|start|>user<|message|>" + message.content + "<|end|>" }} {%- endif -%} {%- endfor -%}
{#- Generation prompt #} {%- if add_generation_prompt -%} <|start|>assistant {%- endif -%}, example_format: '<|start|>system<|message|>You are a helpful assistant<|end|><|start|>user<|message|>Hello<|end|><|start|>assistant<|message|>Hi there<|return|><|start|>user<|message|>How are you?<|end|><|start|>assistant' main: server is listening on http://127.0.0.1:8080 - starting the main loop srv update_slots: all slots are idle srv log_server_r: request: 400 srv params_from_: Chat format: Content-only slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 3 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist slot launch_slot_: id 3 | task 0 | processing task slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 50176, n_keep = 0, task.n_tokens = 9173 slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 2048, batch.n_tokens = 2048, progress = 0.223264 srv log_server_r: request: GET /v1/models 127.0.0.1 200 srv log_server_r: request: GET / 127.0.0.1 200 srv log_server_r: request: GET /props 127.0.0.1 200 srv log_server_r: request: GET /props 127.0.0.1 200 srv log_server_r: request: GET /props 127.0.0.1 200 slot update_slots: id 3 | task 0 | n_tokens = 2048, memory_seq_rm [2048, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 4096, batch.n_tokens = 2048, progress = 0.446528 srv log_server_r: request: GET /slots 127.0.0.1 200 srv log_server_r: request: GET /props 127.0.0.1 200 srv log_server_r: request: GET /props 127.0.0.1 200 srv params_from_: Chat format: Content-only slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1 slot launch_slot_: id 2 | task -1 | sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist slot launch_slot_: id 2 | task 4 | processing task slot update_slots: id 2 | task 4 | new prompt, n_ctx_slot = 50176, n_keep = 0, task.n_tokens = 10 slot update_slots: id 2 | task 4 | n_tokens = 0, memory_seq_rm [0, end) slot update_slots: id 2 | task 4 | prompt processing progress, n_tokens = 10, batch.n_tokens = 10, progress = 1.000000 slot update_slots: id 2 | task 4 | prompt done, n_tokens = 10, batch.n_tokens = 10 slot update_slots: id 3 | task 0 | n_tokens = 4096, memory_seq_rm [4096, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 6134, batch.n_tokens = 2048, progress = 0.668702 slot update_slots: id 3 | task 0 | n_tokens = 6134, memory_seq_rm [6134, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 8181, batch.n_tokens = 2048, progress = 0.891857 srv log_server_r: request: GET /props 127.0.0.1 200 slot update_slots: id 3 | task 0 | n_tokens = 8181, memory_seq_rm [8181, end) srv log_server_r: request: GET /slots 127.0.0.1 200 slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 9109, batch.n_tokens = 929, progress = 0.993023 slot update_slots: id 3 | task 0 | n_tokens = 9109, memory_seq_rm [9109, end) slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 9173, batch.n_tokens = 65, progress = 1.000000 srv log_server_r: request: GET /slots 127.0.0.1 200 slot update_slots: id 3 | task 0 | prompt done, n_tokens = 9173, batch.n_tokens = 65 slot update_slots: id 3 | task 0 | created context checkpoint 1 of 8 (pos_min = 8097, pos_max = 9108, size = 35.590 MiB) srv log_server_r: request: GET /slots 127.0.0.1 200 srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 srv stop: cancel task, id_task = 4 slot release: id 2 | task 4 | stop processing: n_tokens = 59, truncated = 0
VULKAN TEXT:
Tel me a story
. However til: 6:0. :. Let's we:
Let's
Here:
Consider. The
Let:
2:
Consider
Here:
If we
We
Statistics: 2.22 tokens/s 46 tokens