Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache.
Name and Version
./build/bin/llama-cli --version ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes version: 0 (unknown) built with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
(newest b4876 version)
Operating systems
Linux
GGML backends
CUDA
Hardware
Ryzen 3900x + rtx 3060 12gb
Models
Gemma-3-12b_Q5_K_M
Problem description & steps to reproduce
Prompt eval time is way slower when using quantized kv cache than standard kv cache. Also I see that the cpu is used when the quantized kv cache is turned on. So I believe that the kv cache is not properly processed by the gpu if the quantized kv cache is provided
First Bad Commit
No response
Relevant log output
# Unquantized kv cache:
./build/bin/llama-server -m '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf' --n-gpu-layers -1 --batch_size 1024 --flash-attn -c 4000 --port 7777 -t 8 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 0 (unknown) with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 24
system_info: n_threads = 8 (n_threads_batch = 8) / 24 | CUDA : ARCHS = 860 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 7777, http threads: 23
main: loading model
srv load_model: loading model '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 10456 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 626 tensors from /home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 12B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 8: gemma3.block_count u32 = 48
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 31: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 17
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q5_K: 288 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 7.86 GiB (5.74 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma 3
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 787.69 MiB
load_tensors: CUDA0 model buffer size = 8047.63 MiB
.....................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 0.125
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 48, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1536.00 MiB
llama_init_from_model: KV self size = 1536.00 MiB, K (f16): 768.00 MiB, V (f16): 768.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 1.00 MiB
llama_init_from_model: CUDA0 compute buffer size = 519.62 MiB
llama_init_from_model: CUDA_Host compute buffer size = 23.51 MiB
llama_init_from_model: graph nodes = 1737
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- if messages[0]['content'] is string -%}
{%- set first_user_prefix = messages[0]['content'] + '
' -%}
{%- else -%}
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%}
{%- endif -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif -%}
{%- if (message['role'] == 'assistant') -%}
{%- set role = "model" -%}
{%- else -%}
{%- set role = message['role'] -%}
{%- endif -%}
{{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'image' -%}
{{ '<start_of_image>' }}
{%- elif item['type'] == 'text' -%}
{{ item['text'] | trim }}
{%- endif -%}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:7777 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2505
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 1024, n_tokens = 1024, progress = 0.408782
slot update_slots: id 0 | task 0 | kv cache rm [1024, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 1024, progress = 0.817565
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2505, n_tokens = 457, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 2505, n_tokens = 457
slot release: id 0 | task 0 | stop processing: n_past = 3199, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 2697.39 ms / 2505 tokens ( 1.08 ms per token, 928.67 tokens per second)
eval time = 24911.73 ms / 695 tokens ( 35.84 ms per token, 27.90 tokens per second)
total time = 27609.12 ms / 3200 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
# Quantized kv cache:
./build/bin/llama-server -m '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf' --n-gpu-layers -1 --cache-type-k q8_0 --cache-type-v q8_0 --batch_size 1024 --flash-attn -c 4000 --port 7777 -t 8 -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
build: 0 (unknown) with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux
system info: n_threads = 8, n_threads_batch = 8, total_threads = 24
system_info: n_threads = 8 (n_threads_batch = 8) / 24 | CUDA : ARCHS = 860 | FORCE_CUBLAS = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 7777, http threads: 23
main: loading model
srv load_model: loading model '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 10500 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 626 tensors from /home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 12B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 8: gemma3.block_count u32 = 48
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv 22: tokenizer.ggml.scores arr[f32,262208] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,262208] = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 2
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 106
llama_model_loader: - kv 26: tokenizer.ggml.unknown_token_id u32 = 3
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv 31: tokenizer.ggml.add_space_prefix bool = false
llama_model_loader: - kv 32: general.quantization_version u32 = 2
llama_model_loader: - kv 33: general.file_type u32 = 17
llama_model_loader: - type f32: 289 tensors
llama_model_loader: - type q5_K: 288 tensors
llama_model_loader: - type q6_K: 49 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q5_K - Medium
print_info: file size = 7.86 GiB (5.74 BPW)
load: special tokens cache size = 6415
load: token to piece cache size = 1.9446 MB
print_info: arch = gemma3
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 3840
print_info: n_layer = 48
print_info: n_head = 16
print_info: n_head_kv = 8
print_info: n_rot = 256
print_info: n_swa = 1024
print_info: n_embd_head_k = 256
print_info: n_embd_head_v = 256
print_info: n_gqa = 2
print_info: n_embd_k_gqa = 2048
print_info: n_embd_v_gqa = 2048
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 6.2e-02
print_info: n_ff = 15360
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 0.125
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 12B
print_info: model params = 11.77 B
print_info: general.name = Gemma 3
print_info: vocab type = SPM
print_info: n_vocab = 262208
print_info: n_merges = 0
print_info: BOS token = 2 '<bos>'
print_info: EOS token = 106 '<end_of_turn>'
print_info: EOT token = 106 '<end_of_turn>'
print_info: UNK token = 3 '<unk>'
print_info: PAD token = 0 '<pad>'
print_info: LF token = 248 '<0x0A>'
print_info: EOG token = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 48 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CPU_Mapped model buffer size = 787.69 MiB
load_tensors: CUDA0 model buffer size = 8047.63 MiB
.....................................................................................
llama_init_from_model: n_seq_max = 1
llama_init_from_model: n_ctx = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch = 1024
llama_init_from_model: n_ubatch = 512
llama_init_from_model: flash_attn = 1
llama_init_from_model: freq_base = 1000000.0
llama_init_from_model: freq_scale = 0.125
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'q8_0', type_v = 'q8_0', n_layer = 48, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 816.00 MiB
llama_init_from_model: KV self size = 816.00 MiB, K (q8_0): 408.00 MiB, V (q8_0): 408.00 MiB
llama_init_from_model: CUDA_Host output buffer size = 1.00 MiB
llama_init_from_model: CUDA0 compute buffer size = 519.62 MiB
llama_init_from_model: CUDA_Host compute buffer size = 45.01 MiB
llama_init_from_model: graph nodes = 1737
llama_init_from_model: graph splits = 98
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {{ bos_token }}
{%- if messages[0]['role'] == 'system' -%}
{%- if messages[0]['content'] is string -%}
{%- set first_user_prefix = messages[0]['content'] + '
' -%}
{%- else -%}
{%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%}
{%- endif -%}
{%- set loop_messages = messages[1:] -%}
{%- else -%}
{%- set first_user_prefix = "" -%}
{%- set loop_messages = messages -%}
{%- endif -%}
{%- for message in loop_messages -%}
{%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%}
{{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }}
{%- endif -%}
{%- if (message['role'] == 'assistant') -%}
{%- set role = "model" -%}
{%- else -%}
{%- set role = message['role'] -%}
{%- endif -%}
{{ '<start_of_turn>' + role + '
' + (first_user_prefix if loop.first else "") }}
{%- if message['content'] is string -%}
{{ message['content'] | trim }}
{%- elif message['content'] is iterable -%}
{%- for item in message['content'] -%}
{%- if item['type'] == 'image' -%}
{{ '<start_of_image>' }}
{%- elif item['type'] == 'text' -%}
{{ item['text'] | trim }}
{%- endif -%}
{%- endfor -%}
{%- else -%}
{{ raise_exception("Invalid content type") }}
{%- endif -%}
{{ '<end_of_turn>
' }}
{%- endfor -%}
{%- if add_generation_prompt -%}
{{'<start_of_turn>model
'}}
{%- endif -%}
, example_format: '<start_of_turn>user
You are a helpful assistant
Hello<end_of_turn>
<start_of_turn>model
Hi there<end_of_turn>
<start_of_turn>user
How are you?<end_of_turn>
<start_of_turn>model
'
main: server is listening on http://127.0.0.1:7777 - starting the main loop
srv update_slots: all slots are idle
srv params_from_: Chat format: Content-only
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2505
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 1024, n_tokens = 1024, progress = 0.408782
slot update_slots: id 0 | task 0 | kv cache rm [1024, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 1024, progress = 0.817565
slot update_slots: id 0 | task 0 | kv cache rm [2048, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2505, n_tokens = 457, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 2505, n_tokens = 457
slot release: id 0 | task 0 | stop processing: n_past = 3209, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 23150.35 ms / 2505 tokens ( 9.24 ms per token, 108.21 tokens per second)
eval time = 78076.05 ms / 705 tokens ( 110.75 ms per token, 9.03 tokens per second)
total time = 101226.41 ms / 3210 tokens
srv update_slots: all slots are idle
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
The same bug also happens with RekaAI/reka-flash-3.
Also encountered the same problem.
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: yes
Try a build without this option. If you don't remember enabling it, delete the build directory first and reconfigure cmake.
Unfortunately did not change a thing
./bin/llama-server -m '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf' --n-gpu-layers -1 --cache-type-k q8_0 --cache-type-v q8_0 --batch_size 1024 --flash-attn -c 4000 --port 7777 -t 8 -ngl 99 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes build: 0 (unknown) with cc (GCC) 13.3.1 20240611 (Red Hat 13.3.1-2) for x86_64-redhat-linux system info: n_threads = 8, n_threads_batch = 8, total_threads = 24
system_info: n_threads = 8 (n_threads_batch = 8) / 24 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |
main: HTTP server is listening, hostname: 127.0.0.1, port: 7777, http threads: 23
main: loading model
srv load_model: loading model '/home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf'
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3060) - 10444 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 626 tensors from /home/luis/Downloads/llama.cpp-b4876/models/gemma-3-12b-it-Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = gemma3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Gemma 3
llama_model_loader: - kv 3: general.quantized_by str = Unsloth
llama_model_loader: - kv 4: general.size_label str = 12B
llama_model_loader: - kv 5: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 6: gemma3.context_length u32 = 131072
llama_model_loader: - kv 7: gemma3.embedding_length u32 = 3840
llama_model_loader: - kv 8: gemma3.block_count u32 = 48
llama_model_loader: - kv 9: gemma3.feed_forward_length u32 = 15360
llama_model_loader: - kv 10: gemma3.attention.head_count u32 = 16
llama_model_loader: - kv 11: gemma3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 12: gemma3.attention.key_length u32 = 256
llama_model_loader: - kv 13: gemma3.attention.value_length u32 = 256
llama_model_loader: - kv 14: gemma3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: gemma3.attention.sliding_window u32 = 1024
llama_model_loader: - kv 16: gemma3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 17: gemma3.rope.scaling.type str = linear
llama_model_loader: - kv 18: gemma3.rope.scaling.factor f32 = 8.000000
llama_model_loader: - kv 19: tokenizer.ggml.model str = llama
llama_model_loader: - kv 20: tokenizer.ggml.pre str = default
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,262208] = ["
' -%} {%- else -%} {%- set first_user_prefix = messages[0]['content'][0]['text'] + '
' -%} {%- endif -%} {%- set loop_messages = messages[1:] -%} {%- else -%} {%- set first_user_prefix = "" -%} {%- set loop_messages = messages -%} {%- endif -%} {%- for message in loop_messages -%} {%- if (message['role'] == 'user') != (loop.index0 % 2 == 0) -%} {{ raise_exception("Conversation roles must alternate user/assistant/user/assistant/...") }} {%- endif -%} {%- if (message['role'] == 'assistant') -%} {%- set role = "model" -%} {%- else -%} {%- set role = message['role'] -%} {%- endif -%} {{ '<start_of_turn>' + role + ' ' + (first_user_prefix if loop.first else "") }} {%- if message['content'] is string -%} {{ message['content'] | trim }} {%- elif message['content'] is iterable -%} {%- for item in message['content'] -%} {%- if item['type'] == 'image' -%} {{ '<start_of_image>' }} {%- elif item['type'] == 'text' -%} {{ item['text'] | trim }} {%- endif -%} {%- endfor -%} {%- else -%} {{ raise_exception("Invalid content type") }} {%- endif -%} {{ '<end_of_turn> ' }} {%- endfor -%} {%- if add_generation_prompt -%} {{'<start_of_turn>model '}} {%- endif -%} , example_format: '<start_of_turn>user You are a helpful assistant
Hello<end_of_turn> <start_of_turn>model Hi there<end_of_turn> <start_of_turn>user How are you?<end_of_turn> <start_of_turn>model ' main: server is listening on http://127.0.0.1:7777 - starting the main loop srv update_slots: all slots are idle srv log_server_r: request: GET / 127.0.0.1 200 srv log_server_r: request: GET /favicon.ico 127.0.0.1 404 srv params_from_: Chat format: Content-only slot launch_slot_: id 0 | task 0 | processing task slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 2505 slot update_slots: id 0 | task 0 | kv cache rm [0, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 1024, n_tokens = 1024, progress = 0.408782 slot update_slots: id 0 | task 0 | kv cache rm [1024, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2048, n_tokens = 1024, progress = 0.817565 slot update_slots: id 0 | task 0 | kv cache rm [2048, end) slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 2505, n_tokens = 457, progress = 1.000000 slot update_slots: id 0 | task 0 | prompt done, n_past = 2505, n_tokens = 457 slot release: id 0 | task 0 | stop processing: n_past = 3225, truncated = 0 slot print_timing: id 0 | task 0 | prompt eval time = 22698.14 ms / 2505 tokens ( 9.06 ms per token, 110.36 tokens per second) eval time = 78316.51 ms / 721 tokens ( 108.62 ms per token, 9.21 tokens per second) total time = 101014.65 ms / 3226 tokens srv update_slots: all slots are idle srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200
llama_init_from_model: graph splits = 98
The large number of graph splits indicates that there is some operation that is not supported by the CUDA backend, and is being run on the CPU. If you set the environment variable GGML_SCHED_DEBUG=2 and run with -v, you should get a report that will show which operations are being run on the CPU.
This is the log with your env variables and -v
SPLIT #0: CPU # 0 inputs
node # 0 ( GET_ROWS): inp_embd ( 15K) [ CPU ]: token_embd.weight ( 787M) [ CPU ] inp_tokens ( 0K) [ CPU ]
SPLIT #1: CUDA0 # 3 inputs: [inp_embd ( 15K)] [inp_pos ( 0K)] [KQ_mask_swa ( 64K)]
node # 1 ( SCALE): inp_scaled ( 15K) [CUDA0 ]: CUDA0#inp_embd#0 ( 15K) [ NULL ] node # 2 ( RMS_NORM): norm-0 ( 15K) [CUDA0 ]: inp_scaled ( 15K) [CUDA0 ] node # 3 ( MUL): attn_norm-0 ( 15K) [CUDA0 ]: norm-0 ( 15K) [CUDA0 ] blk.0.attn_norm.weig ( 15K) [CUDA0 ] node # 4 ( MUL_MAT): Qcur-0 ( 16K) [CUDA0 ]: blk.0.attn_q.weight ( 8M) [CUDA0 ] attn_norm-0 ( 15K) [CUDA0 ] node # 6 ( RMS_NORM): norm-0 ( 16K) [CUDA0 ]: Qcur-0 (reshaped) ( 16K) [CUDA0 ] node # 7 ( MUL): Qcur_normed-0 ( 16K) [CUDA0 ]: norm-0 ( 16K) [CUDA0 ] blk.0.attn_q_norm.we ( 1K) [CUDA0 ] node # 8 ( ROPE): Qcur-0 ( 16K) [CUDA0 ]: Qcur_normed-0 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 9 ( MUL_MAT): Kcur-0 ( 8K) [CUDA0 ]: blk.0.attn_k.weight ( 4M) [CUDA0 ] attn_norm-0 ( 15K) [CUDA0 ] node # 11 ( RMS_NORM): norm-0 ( 8K) [CUDA0 ]: Kcur-0 (reshaped) ( 8K) [CUDA0 ] node # 12 ( MUL): Kcur_normed-0 ( 8K) [CUDA0 ]: norm-0 ( 8K) [CUDA0 ] blk.0.attn_k_norm.we ( 1K) [CUDA0 ] node # 13 ( ROPE): Kcur-0 ( 8K) [CUDA0 ]: Kcur_normed-0 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 14 ( MUL_MAT): Vcur-0 ( 8K) [CUDA0 ]: blk.0.attn_v.weight ( 6M) [CUDA0 ] attn_norm-0 ( 15K) [CUDA0 ] node # 16 ( CPY): k_cache_view-0 (copy ( 2K) [CUDA0 ]: Kcur-0 ( 8K) [CUDA0 ] k_cache_view-0 ( 2K) [CUDA0 ] node # 18 ( CPY): v_cache_view-0 (copy ( 2K) [CUDA0 ]: Vcur-0 ( 8K) [CUDA0 ] v_cache_view-0 ( 2K) [CUDA0 ] node # 22 ( CPY): KQ_mask_swa (copy) ( 32K) [CUDA0 ]: CUDA0#KQ_mask_swa#0 ( 64K) [ NULL ] KQ_mask_swa (copy) ( 32K) [CUDA0 ]
SPLIT #2: CPU # 4 inputs: [q-0 ( 16K)] [k-0 ( 544K)] [v-0 ( 544K)] [KQ_mask_swa (copy) ( 32K)]
node # 23 (FLASH_ATTN): node_23 ( 16K) [ CPU ]: CPU#q-0#0 ( 16K) [ NULL ] CPU#k-0#0 ( 544K) [ NULL ] CPU#v-0#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #3: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node # 25 ( MUL_MAT): kqv_out-0 ( 15K) [CUDA0 ]: blk.0.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node # 26 ( RMS_NORM): norm-0 ( 15K) [CUDA0 ]: kqv_out-0 ( 15K) [CUDA0 ] node # 27 ( MUL): attn_post_norm-0 ( 15K) [CUDA0 ]: norm-0 ( 15K) [CUDA0 ] blk.0.post_attention ( 15K) [CUDA0 ] node # 28 ( ADD): sa_out-0 ( 15K) [CUDA0 ]: attn_post_norm-0 ( 15K) [CUDA0 ] inp_scaled ( 15K) [CUDA0 ] node # 29 ( RMS_NORM): norm-0 ( 15K) [CUDA0 ]: sa_out-0 ( 15K) [CUDA0 ] node # 30 ( MUL): ffn_norm-0 ( 15K) [CUDA0 ]: norm-0 ( 15K) [CUDA0 ] blk.0.ffn_norm.weigh ( 15K) [CUDA0 ] node # 31 ( MUL_MAT): ffn_gate-0 ( 60K) [CUDA0 ]: blk.0.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-0 ( 15K) [CUDA0 ] node # 32 ( UNARY): ffn_gelu-0 ( 60K) [CUDA0 ]: ffn_gate-0 ( 60K) [CUDA0 ] node # 33 ( MUL_MAT): ffn_up-0 ( 60K) [CUDA0 ]: blk.0.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-0 ( 15K) [CUDA0 ] node # 34 ( MUL): ffn_gate_par-0 ( 60K) [CUDA0 ]: ffn_gelu-0 ( 60K) [CUDA0 ] ffn_up-0 ( 60K) [CUDA0 ] node # 35 ( MUL_MAT): ffn_out-0 ( 15K) [CUDA0 ]: blk.0.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-0 ( 60K) [CUDA0 ] node # 36 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-0 ( 15K) [CUDA0 ] node # 37 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.0.post_ffw_norm. ( 15K) [CUDA0 ] node # 38 ( ADD): l_out-0 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-0 ( 15K) [CUDA0 ] node # 39 ( RMS_NORM): norm-1 ( 15K) [CUDA0 ]: l_out-0 ( 15K) [CUDA0 ] node # 40 ( MUL): attn_norm-1 ( 15K) [CUDA0 ]: norm-1 ( 15K) [CUDA0 ] blk.1.attn_norm.weig ( 15K) [CUDA0 ] node # 41 ( MUL_MAT): Qcur-1 ( 16K) [CUDA0 ]: blk.1.attn_q.weight ( 8M) [CUDA0 ] attn_norm-1 ( 15K) [CUDA0 ] node # 43 ( RMS_NORM): norm-1 ( 16K) [CUDA0 ]: Qcur-1 (reshaped) ( 16K) [CUDA0 ] node # 44 ( MUL): Qcur_normed-1 ( 16K) [CUDA0 ]: norm-1 ( 16K) [CUDA0 ] blk.1.attn_q_norm.we ( 1K) [CUDA0 ] node # 45 ( ROPE): Qcur-1 ( 16K) [CUDA0 ]: Qcur_normed-1 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 46 ( MUL_MAT): Kcur-1 ( 8K) [CUDA0 ]: blk.1.attn_k.weight ( 4M) [CUDA0 ] attn_norm-1 ( 15K) [CUDA0 ] node # 48 ( RMS_NORM): norm-1 ( 8K) [CUDA0 ]: Kcur-1 (reshaped) ( 8K) [CUDA0 ] node # 49 ( MUL): Kcur_normed-1 ( 8K) [CUDA0 ]: norm-1 ( 8K) [CUDA0 ] blk.1.attn_k_norm.we ( 1K) [CUDA0 ] node # 50 ( ROPE): Kcur-1 ( 8K) [CUDA0 ]: Kcur_normed-1 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 51 ( MUL_MAT): Vcur-1 ( 8K) [CUDA0 ]: blk.1.attn_v.weight ( 6M) [CUDA0 ] attn_norm-1 ( 15K) [CUDA0 ] node # 53 ( CPY): k_cache_view-1 (copy ( 2K) [CUDA0 ]: Kcur-1 ( 8K) [CUDA0 ] k_cache_view-1 ( 2K) [CUDA0 ] node # 55 ( CPY): v_cache_view-1 (copy ( 2K) [CUDA0 ]: Vcur-1 ( 8K) [CUDA0 ] v_cache_view-1 ( 2K) [CUDA0 ]
SPLIT #4: CPU # 3 inputs: [q-1 ( 16K)] [k-1 ( 544K)] [v-1 ( 544K)]
node # 59 (FLASH_ATTN): node_59 ( 16K) [ CPU ]: CPU#q-1#0 ( 16K) [ NULL ] CPU#k-1#0 ( 544K) [ NULL ] CPU#v-1#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #5: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node # 61 ( MUL_MAT): kqv_out-1 ( 15K) [CUDA0 ]: blk.1.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node # 62 ( RMS_NORM): norm-1 ( 15K) [CUDA0 ]: kqv_out-1 ( 15K) [CUDA0 ] node # 63 ( MUL): attn_post_norm-1 ( 15K) [CUDA0 ]: norm-1 ( 15K) [CUDA0 ] blk.1.post_attention ( 15K) [CUDA0 ] node # 64 ( ADD): sa_out-1 ( 15K) [CUDA0 ]: attn_post_norm-1 ( 15K) [CUDA0 ] l_out-0 ( 15K) [CUDA0 ] node # 65 ( RMS_NORM): norm-1 ( 15K) [CUDA0 ]: sa_out-1 ( 15K) [CUDA0 ] node # 66 ( MUL): ffn_norm-1 ( 15K) [CUDA0 ]: norm-1 ( 15K) [CUDA0 ] blk.1.ffn_norm.weigh ( 15K) [CUDA0 ] node # 67 ( MUL_MAT): ffn_gate-1 ( 60K) [CUDA0 ]: blk.1.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-1 ( 15K) [CUDA0 ] node # 68 ( UNARY): ffn_gelu-1 ( 60K) [CUDA0 ]: ffn_gate-1 ( 60K) [CUDA0 ] node # 69 ( MUL_MAT): ffn_up-1 ( 60K) [CUDA0 ]: blk.1.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-1 ( 15K) [CUDA0 ] node # 70 ( MUL): ffn_gate_par-1 ( 60K) [CUDA0 ]: ffn_gelu-1 ( 60K) [CUDA0 ] ffn_up-1 ( 60K) [CUDA0 ] node # 71 ( MUL_MAT): ffn_out-1 ( 15K) [CUDA0 ]: blk.1.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-1 ( 60K) [CUDA0 ] node # 72 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-1 ( 15K) [CUDA0 ] node # 73 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.1.post_ffw_norm. ( 15K) [CUDA0 ] node # 74 ( ADD): l_out-1 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-1 ( 15K) [CUDA0 ] node # 75 ( RMS_NORM): norm-2 ( 15K) [CUDA0 ]: l_out-1 ( 15K) [CUDA0 ] node # 76 ( MUL): attn_norm-2 ( 15K) [CUDA0 ]: norm-2 ( 15K) [CUDA0 ] blk.2.attn_norm.weig ( 15K) [CUDA0 ] node # 77 ( MUL_MAT): Qcur-2 ( 16K) [CUDA0 ]: blk.2.attn_q.weight ( 8M) [CUDA0 ] attn_norm-2 ( 15K) [CUDA0 ] node # 79 ( RMS_NORM): norm-2 ( 16K) [CUDA0 ]: Qcur-2 (reshaped) ( 16K) [CUDA0 ] node # 80 ( MUL): Qcur_normed-2 ( 16K) [CUDA0 ]: norm-2 ( 16K) [CUDA0 ] blk.2.attn_q_norm.we ( 1K) [CUDA0 ] node # 81 ( ROPE): Qcur-2 ( 16K) [CUDA0 ]: Qcur_normed-2 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 82 ( MUL_MAT): Kcur-2 ( 8K) [CUDA0 ]: blk.2.attn_k.weight ( 4M) [CUDA0 ] attn_norm-2 ( 15K) [CUDA0 ] node # 84 ( RMS_NORM): norm-2 ( 8K) [CUDA0 ]: Kcur-2 (reshaped) ( 8K) [CUDA0 ] node # 85 ( MUL): Kcur_normed-2 ( 8K) [CUDA0 ]: norm-2 ( 8K) [CUDA0 ] blk.2.attn_k_norm.we ( 1K) [CUDA0 ] node # 86 ( ROPE): Kcur-2 ( 8K) [CUDA0 ]: Kcur_normed-2 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node # 87 ( MUL_MAT): Vcur-2 ( 8K) [CUDA0 ]: blk.2.attn_v.weight ( 6M) [CUDA0 ] attn_norm-2 ( 15K) [CUDA0 ] node # 89 ( CPY): k_cache_view-2 (copy ( 2K) [CUDA0 ]: Kcur-2 ( 8K) [CUDA0 ] k_cache_view-2 ( 2K) [CUDA0 ] node # 91 ( CPY): v_cache_view-2 (copy ( 2K) [CUDA0 ]: Vcur-2 ( 8K) [CUDA0 ] v_cache_view-2 ( 2K) [CUDA0 ]
SPLIT #6: CPU # 3 inputs: [q-2 ( 16K)] [k-2 ( 544K)] [v-2 ( 544K)]
node # 95 (FLASH_ATTN): node_95 ( 16K) [ CPU ]: CPU#q-2#0 ( 16K) [ NULL ] CPU#k-2#0 ( 544K) [ NULL ] CPU#v-2#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #7: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node # 97 ( MUL_MAT): kqv_out-2 ( 15K) [CUDA0 ]: blk.2.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node # 98 ( RMS_NORM): norm-2 ( 15K) [CUDA0 ]: kqv_out-2 ( 15K) [CUDA0 ] node # 99 ( MUL): attn_post_norm-2 ( 15K) [CUDA0 ]: norm-2 ( 15K) [CUDA0 ] blk.2.post_attention ( 15K) [CUDA0 ] node #100 ( ADD): sa_out-2 ( 15K) [CUDA0 ]: attn_post_norm-2 ( 15K) [CUDA0 ] l_out-1 ( 15K) [CUDA0 ] node #101 ( RMS_NORM): norm-2 ( 15K) [CUDA0 ]: sa_out-2 ( 15K) [CUDA0 ] node #102 ( MUL): ffn_norm-2 ( 15K) [CUDA0 ]: norm-2 ( 15K) [CUDA0 ] blk.2.ffn_norm.weigh ( 15K) [CUDA0 ] node #103 ( MUL_MAT): ffn_gate-2 ( 60K) [CUDA0 ]: blk.2.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-2 ( 15K) [CUDA0 ] node #104 ( UNARY): ffn_gelu-2 ( 60K) [CUDA0 ]: ffn_gate-2 ( 60K) [CUDA0 ] node #105 ( MUL_MAT): ffn_up-2 ( 60K) [CUDA0 ]: blk.2.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-2 ( 15K) [CUDA0 ] node #106 ( MUL): ffn_gate_par-2 ( 60K) [CUDA0 ]: ffn_gelu-2 ( 60K) [CUDA0 ] ffn_up-2 ( 60K) [CUDA0 ] node #107 ( MUL_MAT): ffn_out-2 ( 15K) [CUDA0 ]: blk.2.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-2 ( 60K) [CUDA0 ] node #108 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-2 ( 15K) [CUDA0 ] node #109 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.2.post_ffw_norm. ( 15K) [CUDA0 ] node #110 ( ADD): l_out-2 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-2 ( 15K) [CUDA0 ] node #111 ( RMS_NORM): norm-3 ( 15K) [CUDA0 ]: l_out-2 ( 15K) [CUDA0 ] node #112 ( MUL): attn_norm-3 ( 15K) [CUDA0 ]: norm-3 ( 15K) [CUDA0 ] blk.3.attn_norm.weig ( 15K) [CUDA0 ] node #113 ( MUL_MAT): Qcur-3 ( 16K) [CUDA0 ]: blk.3.attn_q.weight ( 8M) [CUDA0 ] attn_norm-3 ( 15K) [CUDA0 ] node #115 ( RMS_NORM): norm-3 ( 16K) [CUDA0 ]: Qcur-3 (reshaped) ( 16K) [CUDA0 ] node #116 ( MUL): Qcur_normed-3 ( 16K) [CUDA0 ]: norm-3 ( 16K) [CUDA0 ] blk.3.attn_q_norm.we ( 1K) [CUDA0 ] node #117 ( ROPE): Qcur-3 ( 16K) [CUDA0 ]: Qcur_normed-3 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #118 ( MUL_MAT): Kcur-3 ( 8K) [CUDA0 ]: blk.3.attn_k.weight ( 4M) [CUDA0 ] attn_norm-3 ( 15K) [CUDA0 ] node #120 ( RMS_NORM): norm-3 ( 8K) [CUDA0 ]: Kcur-3 (reshaped) ( 8K) [CUDA0 ] node #121 ( MUL): Kcur_normed-3 ( 8K) [CUDA0 ]: norm-3 ( 8K) [CUDA0 ] blk.3.attn_k_norm.we ( 1K) [CUDA0 ] node #122 ( ROPE): Kcur-3 ( 8K) [CUDA0 ]: Kcur_normed-3 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #123 ( MUL_MAT): Vcur-3 ( 8K) [CUDA0 ]: blk.3.attn_v.weight ( 6M) [CUDA0 ] attn_norm-3 ( 15K) [CUDA0 ] node #125 ( CPY): k_cache_view-3 (copy ( 2K) [CUDA0 ]: Kcur-3 ( 8K) [CUDA0 ] k_cache_view-3 ( 2K) [CUDA0 ] node #127 ( CPY): v_cache_view-3 (copy ( 2K) [CUDA0 ]: Vcur-3 ( 8K) [CUDA0 ] v_cache_view-3 ( 2K) [CUDA0 ]
SPLIT #8: CPU # 3 inputs: [q-3 ( 16K)] [k-3 ( 544K)] [v-3 ( 544K)]
node #131 (FLASH_ATTN): node_131 ( 16K) [ CPU ]: CPU#q-3#0 ( 16K) [ NULL ] CPU#k-3#0 ( 544K) [ NULL ] CPU#v-3#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #9: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #133 ( MUL_MAT): kqv_out-3 ( 15K) [CUDA0 ]: blk.3.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #134 ( RMS_NORM): norm-3 ( 15K) [CUDA0 ]: kqv_out-3 ( 15K) [CUDA0 ] node #135 ( MUL): attn_post_norm-3 ( 15K) [CUDA0 ]: norm-3 ( 15K) [CUDA0 ] blk.3.post_attention ( 15K) [CUDA0 ] node #136 ( ADD): sa_out-3 ( 15K) [CUDA0 ]: attn_post_norm-3 ( 15K) [CUDA0 ] l_out-2 ( 15K) [CUDA0 ] node #137 ( RMS_NORM): norm-3 ( 15K) [CUDA0 ]: sa_out-3 ( 15K) [CUDA0 ] node #138 ( MUL): ffn_norm-3 ( 15K) [CUDA0 ]: norm-3 ( 15K) [CUDA0 ] blk.3.ffn_norm.weigh ( 15K) [CUDA0 ] node #139 ( MUL_MAT): ffn_gate-3 ( 60K) [CUDA0 ]: blk.3.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-3 ( 15K) [CUDA0 ] node #140 ( UNARY): ffn_gelu-3 ( 60K) [CUDA0 ]: ffn_gate-3 ( 60K) [CUDA0 ] node #141 ( MUL_MAT): ffn_up-3 ( 60K) [CUDA0 ]: blk.3.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-3 ( 15K) [CUDA0 ] node #142 ( MUL): ffn_gate_par-3 ( 60K) [CUDA0 ]: ffn_gelu-3 ( 60K) [CUDA0 ] ffn_up-3 ( 60K) [CUDA0 ] node #143 ( MUL_MAT): ffn_out-3 ( 15K) [CUDA0 ]: blk.3.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-3 ( 60K) [CUDA0 ] node #144 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-3 ( 15K) [CUDA0 ] node #145 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.3.post_ffw_norm. ( 15K) [CUDA0 ] node #146 ( ADD): l_out-3 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-3 ( 15K) [CUDA0 ] node #147 ( RMS_NORM): norm-4 ( 15K) [CUDA0 ]: l_out-3 ( 15K) [CUDA0 ] node #148 ( MUL): attn_norm-4 ( 15K) [CUDA0 ]: norm-4 ( 15K) [CUDA0 ] blk.4.attn_norm.weig ( 15K) [CUDA0 ] node #149 ( MUL_MAT): Qcur-4 ( 16K) [CUDA0 ]: blk.4.attn_q.weight ( 8M) [CUDA0 ] attn_norm-4 ( 15K) [CUDA0 ] node #151 ( RMS_NORM): norm-4 ( 16K) [CUDA0 ]: Qcur-4 (reshaped) ( 16K) [CUDA0 ] node #152 ( MUL): Qcur_normed-4 ( 16K) [CUDA0 ]: norm-4 ( 16K) [CUDA0 ] blk.4.attn_q_norm.we ( 1K) [CUDA0 ] node #153 ( ROPE): Qcur-4 ( 16K) [CUDA0 ]: Qcur_normed-4 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #154 ( MUL_MAT): Kcur-4 ( 8K) [CUDA0 ]: blk.4.attn_k.weight ( 4M) [CUDA0 ] attn_norm-4 ( 15K) [CUDA0 ] node #156 ( RMS_NORM): norm-4 ( 8K) [CUDA0 ]: Kcur-4 (reshaped) ( 8K) [CUDA0 ] node #157 ( MUL): Kcur_normed-4 ( 8K) [CUDA0 ]: norm-4 ( 8K) [CUDA0 ] blk.4.attn_k_norm.we ( 1K) [CUDA0 ] node #158 ( ROPE): Kcur-4 ( 8K) [CUDA0 ]: Kcur_normed-4 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #159 ( MUL_MAT): Vcur-4 ( 8K) [CUDA0 ]: blk.4.attn_v.weight ( 6M) [CUDA0 ] attn_norm-4 ( 15K) [CUDA0 ] node #161 ( CPY): k_cache_view-4 (copy ( 2K) [CUDA0 ]: Kcur-4 ( 8K) [CUDA0 ] k_cache_view-4 ( 2K) [CUDA0 ] node #163 ( CPY): v_cache_view-4 (copy ( 2K) [CUDA0 ]: Vcur-4 ( 8K) [CUDA0 ] v_cache_view-4 ( 2K) [CUDA0 ]
SPLIT #10: CPU # 3 inputs: [q-4 ( 16K)] [k-4 ( 544K)] [v-4 ( 544K)]
node #167 (FLASH_ATTN): node_167 ( 16K) [ CPU ]: CPU#q-4#0 ( 16K) [ NULL ] CPU#k-4#0 ( 544K) [ NULL ] CPU#v-4#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #11: CUDA0 # 2 inputs: [ (reshaped) ( 16K)] [KQ_mask ( 64K)]
node #169 ( MUL_MAT): kqv_out-4 ( 15K) [CUDA0 ]: blk.4.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #170 ( RMS_NORM): norm-4 ( 15K) [CUDA0 ]: kqv_out-4 ( 15K) [CUDA0 ] node #171 ( MUL): attn_post_norm-4 ( 15K) [CUDA0 ]: norm-4 ( 15K) [CUDA0 ] blk.4.post_attention ( 15K) [CUDA0 ] node #172 ( ADD): sa_out-4 ( 15K) [CUDA0 ]: attn_post_norm-4 ( 15K) [CUDA0 ] l_out-3 ( 15K) [CUDA0 ] node #173 ( RMS_NORM): norm-4 ( 15K) [CUDA0 ]: sa_out-4 ( 15K) [CUDA0 ] node #174 ( MUL): ffn_norm-4 ( 15K) [CUDA0 ]: norm-4 ( 15K) [CUDA0 ] blk.4.ffn_norm.weigh ( 15K) [CUDA0 ] node #175 ( MUL_MAT): ffn_gate-4 ( 60K) [CUDA0 ]: blk.4.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-4 ( 15K) [CUDA0 ] node #176 ( UNARY): ffn_gelu-4 ( 60K) [CUDA0 ]: ffn_gate-4 ( 60K) [CUDA0 ] node #177 ( MUL_MAT): ffn_up-4 ( 60K) [CUDA0 ]: blk.4.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-4 ( 15K) [CUDA0 ] node #178 ( MUL): ffn_gate_par-4 ( 60K) [CUDA0 ]: ffn_gelu-4 ( 60K) [CUDA0 ] ffn_up-4 ( 60K) [CUDA0 ] node #179 ( MUL_MAT): ffn_out-4 ( 15K) [CUDA0 ]: blk.4.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-4 ( 60K) [CUDA0 ] node #180 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-4 ( 15K) [CUDA0 ] node #181 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.4.post_ffw_norm. ( 15K) [CUDA0 ] node #182 ( ADD): l_out-4 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-4 ( 15K) [CUDA0 ] node #183 ( RMS_NORM): norm-5 ( 15K) [CUDA0 ]: l_out-4 ( 15K) [CUDA0 ] node #184 ( MUL): attn_norm-5 ( 15K) [CUDA0 ]: norm-5 ( 15K) [CUDA0 ] blk.5.attn_norm.weig ( 15K) [CUDA0 ] node #185 ( MUL_MAT): Qcur-5 ( 16K) [CUDA0 ]: blk.5.attn_q.weight ( 8M) [CUDA0 ] attn_norm-5 ( 15K) [CUDA0 ] node #187 ( RMS_NORM): norm-5 ( 16K) [CUDA0 ]: Qcur-5 (reshaped) ( 16K) [CUDA0 ] node #188 ( MUL): Qcur_normed-5 ( 16K) [CUDA0 ]: norm-5 ( 16K) [CUDA0 ] blk.5.attn_q_norm.we ( 1K) [CUDA0 ] node #189 ( ROPE): Qcur-5 ( 16K) [CUDA0 ]: Qcur_normed-5 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #190 ( MUL_MAT): Kcur-5 ( 8K) [CUDA0 ]: blk.5.attn_k.weight ( 4M) [CUDA0 ] attn_norm-5 ( 15K) [CUDA0 ] node #192 ( RMS_NORM): norm-5 ( 8K) [CUDA0 ]: Kcur-5 (reshaped) ( 8K) [CUDA0 ] node #193 ( MUL): Kcur_normed-5 ( 8K) [CUDA0 ]: norm-5 ( 8K) [CUDA0 ] blk.5.attn_k_norm.we ( 1K) [CUDA0 ] node #194 ( ROPE): Kcur-5 ( 8K) [CUDA0 ]: Kcur_normed-5 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #195 ( MUL_MAT): Vcur-5 ( 8K) [CUDA0 ]: blk.5.attn_v.weight ( 6M) [CUDA0 ] attn_norm-5 ( 15K) [CUDA0 ] node #197 ( CPY): k_cache_view-5 (copy ( 2K) [CUDA0 ]: Kcur-5 ( 8K) [CUDA0 ] k_cache_view-5 ( 2K) [CUDA0 ] node #199 ( CPY): v_cache_view-5 (copy ( 2K) [CUDA0 ]: Vcur-5 ( 8K) [CUDA0 ] v_cache_view-5 ( 2K) [CUDA0 ] node #203 ( CPY): KQ_mask (copy) ( 32K) [CUDA0 ]: CUDA0#KQ_mask#0 ( 64K) [ NULL ] KQ_mask (copy) ( 32K) [CUDA0 ]
SPLIT #12: CPU # 4 inputs: [q-5 ( 16K)] [k-5 ( 544K)] [v-5 ( 544K)] [KQ_mask (copy) ( 32K)]
node #204 (FLASH_ATTN): node_204 ( 16K) [ CPU ]: CPU#q-5#0 ( 16K) [ NULL ] CPU#k-5#0 ( 544K) [ NULL ] CPU#v-5#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #13: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #206 ( MUL_MAT): kqv_out-5 ( 15K) [CUDA0 ]: blk.5.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #207 ( RMS_NORM): norm-5 ( 15K) [CUDA0 ]: kqv_out-5 ( 15K) [CUDA0 ] node #208 ( MUL): attn_post_norm-5 ( 15K) [CUDA0 ]: norm-5 ( 15K) [CUDA0 ] blk.5.post_attention ( 15K) [CUDA0 ] node #209 ( ADD): sa_out-5 ( 15K) [CUDA0 ]: attn_post_norm-5 ( 15K) [CUDA0 ] l_out-4 ( 15K) [CUDA0 ] node #210 ( RMS_NORM): norm-5 ( 15K) [CUDA0 ]: sa_out-5 ( 15K) [CUDA0 ] node #211 ( MUL): ffn_norm-5 ( 15K) [CUDA0 ]: norm-5 ( 15K) [CUDA0 ] blk.5.ffn_norm.weigh ( 15K) [CUDA0 ] node #212 ( MUL_MAT): ffn_gate-5 ( 60K) [CUDA0 ]: blk.5.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-5 ( 15K) [CUDA0 ] node #213 ( UNARY): ffn_gelu-5 ( 60K) [CUDA0 ]: ffn_gate-5 ( 60K) [CUDA0 ] node #214 ( MUL_MAT): ffn_up-5 ( 60K) [CUDA0 ]: blk.5.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-5 ( 15K) [CUDA0 ] node #215 ( MUL): ffn_gate_par-5 ( 60K) [CUDA0 ]: ffn_gelu-5 ( 60K) [CUDA0 ] ffn_up-5 ( 60K) [CUDA0 ] node #216 ( MUL_MAT): ffn_out-5 ( 15K) [CUDA0 ]: blk.5.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-5 ( 60K) [CUDA0 ] node #217 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-5 ( 15K) [CUDA0 ] node #218 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.5.post_ffw_norm. ( 15K) [CUDA0 ] node #219 ( ADD): l_out-5 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-5 ( 15K) [CUDA0 ] node #220 ( RMS_NORM): norm-6 ( 15K) [CUDA0 ]: l_out-5 ( 15K) [CUDA0 ] node #221 ( MUL): attn_norm-6 ( 15K) [CUDA0 ]: norm-6 ( 15K) [CUDA0 ] blk.6.attn_norm.weig ( 15K) [CUDA0 ] node #222 ( MUL_MAT): Qcur-6 ( 16K) [CUDA0 ]: blk.6.attn_q.weight ( 8M) [CUDA0 ] attn_norm-6 ( 15K) [CUDA0 ] node #224 ( RMS_NORM): norm-6 ( 16K) [CUDA0 ]: Qcur-6 (reshaped) ( 16K) [CUDA0 ] node #225 ( MUL): Qcur_normed-6 ( 16K) [CUDA0 ]: norm-6 ( 16K) [CUDA0 ] blk.6.attn_q_norm.we ( 1K) [CUDA0 ] node #226 ( ROPE): Qcur-6 ( 16K) [CUDA0 ]: Qcur_normed-6 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #227 ( MUL_MAT): Kcur-6 ( 8K) [CUDA0 ]: blk.6.attn_k.weight ( 4M) [CUDA0 ] attn_norm-6 ( 15K) [CUDA0 ] node #229 ( RMS_NORM): norm-6 ( 8K) [CUDA0 ]: Kcur-6 (reshaped) ( 8K) [CUDA0 ] node #230 ( MUL): Kcur_normed-6 ( 8K) [CUDA0 ]: norm-6 ( 8K) [CUDA0 ] blk.6.attn_k_norm.we ( 1K) [CUDA0 ] node #231 ( ROPE): Kcur-6 ( 8K) [CUDA0 ]: Kcur_normed-6 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #232 ( MUL_MAT): Vcur-6 ( 8K) [CUDA0 ]: blk.6.attn_v.weight ( 4M) [CUDA0 ] attn_norm-6 ( 15K) [CUDA0 ] node #234 ( CPY): k_cache_view-6 (copy ( 2K) [CUDA0 ]: Kcur-6 ( 8K) [CUDA0 ] k_cache_view-6 ( 2K) [CUDA0 ] node #236 ( CPY): v_cache_view-6 (copy ( 2K) [CUDA0 ]: Vcur-6 ( 8K) [CUDA0 ] v_cache_view-6 ( 2K) [CUDA0 ]
SPLIT #14: CPU # 3 inputs: [q-6 ( 16K)] [k-6 ( 544K)] [v-6 ( 544K)]
node #240 (FLASH_ATTN): node_240 ( 16K) [ CPU ]: CPU#q-6#0 ( 16K) [ NULL ] CPU#k-6#0 ( 544K) [ NULL ] CPU#v-6#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #15: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #242 ( MUL_MAT): kqv_out-6 ( 15K) [CUDA0 ]: blk.6.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #243 ( RMS_NORM): norm-6 ( 15K) [CUDA0 ]: kqv_out-6 ( 15K) [CUDA0 ] node #244 ( MUL): attn_post_norm-6 ( 15K) [CUDA0 ]: norm-6 ( 15K) [CUDA0 ] blk.6.post_attention ( 15K) [CUDA0 ] node #245 ( ADD): sa_out-6 ( 15K) [CUDA0 ]: attn_post_norm-6 ( 15K) [CUDA0 ] l_out-5 ( 15K) [CUDA0 ] node #246 ( RMS_NORM): norm-6 ( 15K) [CUDA0 ]: sa_out-6 ( 15K) [CUDA0 ] node #247 ( MUL): ffn_norm-6 ( 15K) [CUDA0 ]: norm-6 ( 15K) [CUDA0 ] blk.6.ffn_norm.weigh ( 15K) [CUDA0 ] node #248 ( MUL_MAT): ffn_gate-6 ( 60K) [CUDA0 ]: blk.6.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-6 ( 15K) [CUDA0 ] node #249 ( UNARY): ffn_gelu-6 ( 60K) [CUDA0 ]: ffn_gate-6 ( 60K) [CUDA0 ] node #250 ( MUL_MAT): ffn_up-6 ( 60K) [CUDA0 ]: blk.6.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-6 ( 15K) [CUDA0 ] node #251 ( MUL): ffn_gate_par-6 ( 60K) [CUDA0 ]: ffn_gelu-6 ( 60K) [CUDA0 ] ffn_up-6 ( 60K) [CUDA0 ] node #252 ( MUL_MAT): ffn_out-6 ( 15K) [CUDA0 ]: blk.6.ffn_down.weigh ( 31M) [CUDA0 ] ffn_gate_par-6 ( 60K) [CUDA0 ] node #253 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-6 ( 15K) [CUDA0 ] node #254 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.6.post_ffw_norm. ( 15K) [CUDA0 ] node #255 ( ADD): l_out-6 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-6 ( 15K) [CUDA0 ] node #256 ( RMS_NORM): norm-7 ( 15K) [CUDA0 ]: l_out-6 ( 15K) [CUDA0 ] node #257 ( MUL): attn_norm-7 ( 15K) [CUDA0 ]: norm-7 ( 15K) [CUDA0 ] blk.7.attn_norm.weig ( 15K) [CUDA0 ] node #258 ( MUL_MAT): Qcur-7 ( 16K) [CUDA0 ]: blk.7.attn_q.weight ( 8M) [CUDA0 ] attn_norm-7 ( 15K) [CUDA0 ] node #260 ( RMS_NORM): norm-7 ( 16K) [CUDA0 ]: Qcur-7 (reshaped) ( 16K) [CUDA0 ] node #261 ( MUL): Qcur_normed-7 ( 16K) [CUDA0 ]: norm-7 ( 16K) [CUDA0 ] blk.7.attn_q_norm.we ( 1K) [CUDA0 ] node #262 ( ROPE): Qcur-7 ( 16K) [CUDA0 ]: Qcur_normed-7 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #263 ( MUL_MAT): Kcur-7 ( 8K) [CUDA0 ]: blk.7.attn_k.weight ( 4M) [CUDA0 ] attn_norm-7 ( 15K) [CUDA0 ] node #265 ( RMS_NORM): norm-7 ( 8K) [CUDA0 ]: Kcur-7 (reshaped) ( 8K) [CUDA0 ] node #266 ( MUL): Kcur_normed-7 ( 8K) [CUDA0 ]: norm-7 ( 8K) [CUDA0 ] blk.7.attn_k_norm.we ( 1K) [CUDA0 ] node #267 ( ROPE): Kcur-7 ( 8K) [CUDA0 ]: Kcur_normed-7 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #268 ( MUL_MAT): Vcur-7 ( 8K) [CUDA0 ]: blk.7.attn_v.weight ( 4M) [CUDA0 ] attn_norm-7 ( 15K) [CUDA0 ] node #270 ( CPY): k_cache_view-7 (copy ( 2K) [CUDA0 ]: Kcur-7 ( 8K) [CUDA0 ] k_cache_view-7 ( 2K) [CUDA0 ] node #272 ( CPY): v_cache_view-7 (copy ( 2K) [CUDA0 ]: Vcur-7 ( 8K) [CUDA0 ] v_cache_view-7 ( 2K) [CUDA0 ]
SPLIT #16: CPU # 3 inputs: [q-7 ( 16K)] [k-7 ( 544K)] [v-7 ( 544K)]
node #276 (FLASH_ATTN): node_276 ( 16K) [ CPU ]: CPU#q-7#0 ( 16K) [ NULL ] CPU#k-7#0 ( 544K) [ NULL ] CPU#v-7#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #17: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #278 ( MUL_MAT): kqv_out-7 ( 15K) [CUDA0 ]: blk.7.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #279 ( RMS_NORM): norm-7 ( 15K) [CUDA0 ]: kqv_out-7 ( 15K) [CUDA0 ] node #280 ( MUL): attn_post_norm-7 ( 15K) [CUDA0 ]: norm-7 ( 15K) [CUDA0 ] blk.7.post_attention ( 15K) [CUDA0 ] node #281 ( ADD): sa_out-7 ( 15K) [CUDA0 ]: attn_post_norm-7 ( 15K) [CUDA0 ] l_out-6 ( 15K) [CUDA0 ] node #282 ( RMS_NORM): norm-7 ( 15K) [CUDA0 ]: sa_out-7 ( 15K) [CUDA0 ] node #283 ( MUL): ffn_norm-7 ( 15K) [CUDA0 ]: norm-7 ( 15K) [CUDA0 ] blk.7.ffn_norm.weigh ( 15K) [CUDA0 ] node #284 ( MUL_MAT): ffn_gate-7 ( 60K) [CUDA0 ]: blk.7.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-7 ( 15K) [CUDA0 ] node #285 ( UNARY): ffn_gelu-7 ( 60K) [CUDA0 ]: ffn_gate-7 ( 60K) [CUDA0 ] node #286 ( MUL_MAT): ffn_up-7 ( 60K) [CUDA0 ]: blk.7.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-7 ( 15K) [CUDA0 ] node #287 ( MUL): ffn_gate_par-7 ( 60K) [CUDA0 ]: ffn_gelu-7 ( 60K) [CUDA0 ] ffn_up-7 ( 60K) [CUDA0 ] node #288 ( MUL_MAT): ffn_out-7 ( 15K) [CUDA0 ]: blk.7.ffn_down.weigh ( 31M) [CUDA0 ] ffn_gate_par-7 ( 60K) [CUDA0 ] node #289 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-7 ( 15K) [CUDA0 ] node #290 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.7.post_ffw_norm. ( 15K) [CUDA0 ] node #291 ( ADD): l_out-7 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-7 ( 15K) [CUDA0 ] node #292 ( RMS_NORM): norm-8 ( 15K) [CUDA0 ]: l_out-7 ( 15K) [CUDA0 ] node #293 ( MUL): attn_norm-8 ( 15K) [CUDA0 ]: norm-8 ( 15K) [CUDA0 ] blk.8.attn_norm.weig ( 15K) [CUDA0 ] node #294 ( MUL_MAT): Qcur-8 ( 16K) [CUDA0 ]: blk.8.attn_q.weight ( 8M) [CUDA0 ] attn_norm-8 ( 15K) [CUDA0 ] node #296 ( RMS_NORM): norm-8 ( 16K) [CUDA0 ]: Qcur-8 (reshaped) ( 16K) [CUDA0 ] node #297 ( MUL): Qcur_normed-8 ( 16K) [CUDA0 ]: norm-8 ( 16K) [CUDA0 ] blk.8.attn_q_norm.we ( 1K) [CUDA0 ] node #298 ( ROPE): Qcur-8 ( 16K) [CUDA0 ]: Qcur_normed-8 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #299 ( MUL_MAT): Kcur-8 ( 8K) [CUDA0 ]: blk.8.attn_k.weight ( 4M) [CUDA0 ] attn_norm-8 ( 15K) [CUDA0 ] node #301 ( RMS_NORM): norm-8 ( 8K) [CUDA0 ]: Kcur-8 (reshaped) ( 8K) [CUDA0 ] node #302 ( MUL): Kcur_normed-8 ( 8K) [CUDA0 ]: norm-8 ( 8K) [CUDA0 ] blk.8.attn_k_norm.we ( 1K) [CUDA0 ] node #303 ( ROPE): Kcur-8 ( 8K) [CUDA0 ]: Kcur_normed-8 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #304 ( MUL_MAT): Vcur-8 ( 8K) [CUDA0 ]: blk.8.attn_v.weight ( 6M) [CUDA0 ] attn_norm-8 ( 15K) [CUDA0 ] node #306 ( CPY): k_cache_view-8 (copy ( 2K) [CUDA0 ]: Kcur-8 ( 8K) [CUDA0 ] k_cache_view-8 ( 2K) [CUDA0 ] node #308 ( CPY): v_cache_view-8 (copy ( 2K) [CUDA0 ]: Vcur-8 ( 8K) [CUDA0 ] v_cache_view-8 ( 2K) [CUDA0 ]
SPLIT #18: CPU # 3 inputs: [q-8 ( 16K)] [k-8 ( 544K)] [v-8 ( 544K)]
node #312 (FLASH_ATTN): node_312 ( 16K) [ CPU ]: CPU#q-8#0 ( 16K) [ NULL ] CPU#k-8#0 ( 544K) [ NULL ] CPU#v-8#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #19: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #314 ( MUL_MAT): kqv_out-8 ( 15K) [CUDA0 ]: blk.8.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #315 ( RMS_NORM): norm-8 ( 15K) [CUDA0 ]: kqv_out-8 ( 15K) [CUDA0 ] node #316 ( MUL): attn_post_norm-8 ( 15K) [CUDA0 ]: norm-8 ( 15K) [CUDA0 ] blk.8.post_attention ( 15K) [CUDA0 ] node #317 ( ADD): sa_out-8 ( 15K) [CUDA0 ]: attn_post_norm-8 ( 15K) [CUDA0 ] l_out-7 ( 15K) [CUDA0 ] node #318 ( RMS_NORM): norm-8 ( 15K) [CUDA0 ]: sa_out-8 ( 15K) [CUDA0 ] node #319 ( MUL): ffn_norm-8 ( 15K) [CUDA0 ]: norm-8 ( 15K) [CUDA0 ] blk.8.ffn_norm.weigh ( 15K) [CUDA0 ] node #320 ( MUL_MAT): ffn_gate-8 ( 60K) [CUDA0 ]: blk.8.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-8 ( 15K) [CUDA0 ] node #321 ( UNARY): ffn_gelu-8 ( 60K) [CUDA0 ]: ffn_gate-8 ( 60K) [CUDA0 ] node #322 ( MUL_MAT): ffn_up-8 ( 60K) [CUDA0 ]: blk.8.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-8 ( 15K) [CUDA0 ] node #323 ( MUL): ffn_gate_par-8 ( 60K) [CUDA0 ]: ffn_gelu-8 ( 60K) [CUDA0 ] ffn_up-8 ( 60K) [CUDA0 ] node #324 ( MUL_MAT): ffn_out-8 ( 15K) [CUDA0 ]: blk.8.ffn_down.weigh ( 46M) [CUDA0 ] ffn_gate_par-8 ( 60K) [CUDA0 ] node #325 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-8 ( 15K) [CUDA0 ] node #326 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.8.post_ffw_norm. ( 15K) [CUDA0 ] node #327 ( ADD): l_out-8 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-8 ( 15K) [CUDA0 ] node #328 ( RMS_NORM): norm-9 ( 15K) [CUDA0 ]: l_out-8 ( 15K) [CUDA0 ] node #329 ( MUL): attn_norm-9 ( 15K) [CUDA0 ]: norm-9 ( 15K) [CUDA0 ] blk.9.attn_norm.weig ( 15K) [CUDA0 ] node #330 ( MUL_MAT): Qcur-9 ( 16K) [CUDA0 ]: blk.9.attn_q.weight ( 8M) [CUDA0 ] attn_norm-9 ( 15K) [CUDA0 ] node #332 ( RMS_NORM): norm-9 ( 16K) [CUDA0 ]: Qcur-9 (reshaped) ( 16K) [CUDA0 ] node #333 ( MUL): Qcur_normed-9 ( 16K) [CUDA0 ]: norm-9 ( 16K) [CUDA0 ] blk.9.attn_q_norm.we ( 1K) [CUDA0 ] node #334 ( ROPE): Qcur-9 ( 16K) [CUDA0 ]: Qcur_normed-9 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #335 ( MUL_MAT): Kcur-9 ( 8K) [CUDA0 ]: blk.9.attn_k.weight ( 4M) [CUDA0 ] attn_norm-9 ( 15K) [CUDA0 ] node #337 ( RMS_NORM): norm-9 ( 8K) [CUDA0 ]: Kcur-9 (reshaped) ( 8K) [CUDA0 ] node #338 ( MUL): Kcur_normed-9 ( 8K) [CUDA0 ]: norm-9 ( 8K) [CUDA0 ] blk.9.attn_k_norm.we ( 1K) [CUDA0 ] node #339 ( ROPE): Kcur-9 ( 8K) [CUDA0 ]: Kcur_normed-9 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #340 ( MUL_MAT): Vcur-9 ( 8K) [CUDA0 ]: blk.9.attn_v.weight ( 4M) [CUDA0 ] attn_norm-9 ( 15K) [CUDA0 ] node #342 ( CPY): k_cache_view-9 (copy ( 2K) [CUDA0 ]: Kcur-9 ( 8K) [CUDA0 ] k_cache_view-9 ( 2K) [CUDA0 ] node #344 ( CPY): v_cache_view-9 (copy ( 2K) [CUDA0 ]: Vcur-9 ( 8K) [CUDA0 ] v_cache_view-9 ( 2K) [CUDA0 ]
SPLIT #20: CPU # 3 inputs: [q-9 ( 16K)] [k-9 ( 544K)] [v-9 ( 544K)]
node #348 (FLASH_ATTN): node_348 ( 16K) [ CPU ]: CPU#q-9#0 ( 16K) [ NULL ] CPU#k-9#0 ( 544K) [ NULL ] CPU#v-9#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #21: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #350 ( MUL_MAT): kqv_out-9 ( 15K) [CUDA0 ]: blk.9.attn_output.we ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #351 ( RMS_NORM): norm-9 ( 15K) [CUDA0 ]: kqv_out-9 ( 15K) [CUDA0 ] node #352 ( MUL): attn_post_norm-9 ( 15K) [CUDA0 ]: norm-9 ( 15K) [CUDA0 ] blk.9.post_attention ( 15K) [CUDA0 ] node #353 ( ADD): sa_out-9 ( 15K) [CUDA0 ]: attn_post_norm-9 ( 15K) [CUDA0 ] l_out-8 ( 15K) [CUDA0 ] node #354 ( RMS_NORM): norm-9 ( 15K) [CUDA0 ]: sa_out-9 ( 15K) [CUDA0 ] node #355 ( MUL): ffn_norm-9 ( 15K) [CUDA0 ]: norm-9 ( 15K) [CUDA0 ] blk.9.ffn_norm.weigh ( 15K) [CUDA0 ] node #356 ( MUL_MAT): ffn_gate-9 ( 60K) [CUDA0 ]: blk.9.ffn_gate.weigh ( 31M) [CUDA0 ] ffn_norm-9 ( 15K) [CUDA0 ] node #357 ( UNARY): ffn_gelu-9 ( 60K) [CUDA0 ]: ffn_gate-9 ( 60K) [CUDA0 ] node #358 ( MUL_MAT): ffn_up-9 ( 60K) [CUDA0 ]: blk.9.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-9 ( 15K) [CUDA0 ] node #359 ( MUL): ffn_gate_par-9 ( 60K) [CUDA0 ]: ffn_gelu-9 ( 60K) [CUDA0 ] ffn_up-9 ( 60K) [CUDA0 ] node #360 ( MUL_MAT): ffn_out-9 ( 15K) [CUDA0 ]: blk.9.ffn_down.weigh ( 31M) [CUDA0 ] ffn_gate_par-9 ( 60K) [CUDA0 ] node #361 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-9 ( 15K) [CUDA0 ] node #362 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.9.post_ffw_norm. ( 15K) [CUDA0 ] node #363 ( ADD): l_out-9 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-9 ( 15K) [CUDA0 ] node #364 ( RMS_NORM): norm-10 ( 15K) [CUDA0 ]: l_out-9 ( 15K) [CUDA0 ] node #365 ( MUL): attn_norm-10 ( 15K) [CUDA0 ]: norm-10 ( 15K) [CUDA0 ] blk.10.attn_norm.wei ( 15K) [CUDA0 ] node #366 ( MUL_MAT): Qcur-10 ( 16K) [CUDA0 ]: blk.10.attn_q.weight ( 8M) [CUDA0 ] attn_norm-10 ( 15K) [CUDA0 ] node #368 ( RMS_NORM): norm-10 ( 16K) [CUDA0 ]: Qcur-10 (reshaped) ( 16K) [CUDA0 ] node #369 ( MUL): Qcur_normed-10 ( 16K) [CUDA0 ]: norm-10 ( 16K) [CUDA0 ] blk.10.attn_q_norm.w ( 1K) [CUDA0 ] node #370 ( ROPE): Qcur-10 ( 16K) [CUDA0 ]: Qcur_normed-10 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #371 ( MUL_MAT): Kcur-10 ( 8K) [CUDA0 ]: blk.10.attn_k.weight ( 4M) [CUDA0 ] attn_norm-10 ( 15K) [CUDA0 ] node #373 ( RMS_NORM): norm-10 ( 8K) [CUDA0 ]: Kcur-10 (reshaped) ( 8K) [CUDA0 ] node #374 ( MUL): Kcur_normed-10 ( 8K) [CUDA0 ]: norm-10 ( 8K) [CUDA0 ] blk.10.attn_k_norm.w ( 1K) [CUDA0 ] node #375 ( ROPE): Kcur-10 ( 8K) [CUDA0 ]: Kcur_normed-10 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #376 ( MUL_MAT): Vcur-10 ( 8K) [CUDA0 ]: blk.10.attn_v.weight ( 4M) [CUDA0 ] attn_norm-10 ( 15K) [CUDA0 ] node #378 ( CPY): k_cache_view-10 (cop ( 2K) [CUDA0 ]: Kcur-10 ( 8K) [CUDA0 ] k_cache_view-10 ( 2K) [CUDA0 ] node #380 ( CPY): v_cache_view-10 (cop ( 2K) [CUDA0 ]: Vcur-10 ( 8K) [CUDA0 ] v_cache_view-10 ( 2K) [CUDA0 ]
SPLIT #22: CPU # 3 inputs: [q-10 ( 16K)] [k-10 ( 544K)] [v-10 ( 544K)]
node #384 (FLASH_ATTN): node_384 ( 16K) [ CPU ]: CPU#q-10#0 ( 16K) [ NULL ] CPU#k-10#0 ( 544K) [ NULL ] CPU#v-10#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #23: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #386 ( MUL_MAT): kqv_out-10 ( 15K) [CUDA0 ]: blk.10.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #387 ( RMS_NORM): norm-10 ( 15K) [CUDA0 ]: kqv_out-10 ( 15K) [CUDA0 ] node #388 ( MUL): attn_post_norm-10 ( 15K) [CUDA0 ]: norm-10 ( 15K) [CUDA0 ] blk.10.post_attentio ( 15K) [CUDA0 ] node #389 ( ADD): sa_out-10 ( 15K) [CUDA0 ]: attn_post_norm-10 ( 15K) [CUDA0 ] l_out-9 ( 15K) [CUDA0 ] node #390 ( RMS_NORM): norm-10 ( 15K) [CUDA0 ]: sa_out-10 ( 15K) [CUDA0 ] node #391 ( MUL): ffn_norm-10 ( 15K) [CUDA0 ]: norm-10 ( 15K) [CUDA0 ] blk.10.ffn_norm.weig ( 15K) [CUDA0 ] node #392 ( MUL_MAT): ffn_gate-10 ( 60K) [CUDA0 ]: blk.10.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-10 ( 15K) [CUDA0 ] node #393 ( UNARY): ffn_gelu-10 ( 60K) [CUDA0 ]: ffn_gate-10 ( 60K) [CUDA0 ] node #394 ( MUL_MAT): ffn_up-10 ( 60K) [CUDA0 ]: blk.10.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-10 ( 15K) [CUDA0 ] node #395 ( MUL): ffn_gate_par-10 ( 60K) [CUDA0 ]: ffn_gelu-10 ( 60K) [CUDA0 ] ffn_up-10 ( 60K) [CUDA0 ] node #396 ( MUL_MAT): ffn_out-10 ( 15K) [CUDA0 ]: blk.10.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-10 ( 60K) [CUDA0 ] node #397 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-10 ( 15K) [CUDA0 ] node #398 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.10.post_ffw_norm ( 15K) [CUDA0 ] node #399 ( ADD): l_out-10 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-10 ( 15K) [CUDA0 ] node #400 ( RMS_NORM): norm-11 ( 15K) [CUDA0 ]: l_out-10 ( 15K) [CUDA0 ] node #401 ( MUL): attn_norm-11 ( 15K) [CUDA0 ]: norm-11 ( 15K) [CUDA0 ] blk.11.attn_norm.wei ( 15K) [CUDA0 ] node #402 ( MUL_MAT): Qcur-11 ( 16K) [CUDA0 ]: blk.11.attn_q.weight ( 8M) [CUDA0 ] attn_norm-11 ( 15K) [CUDA0 ] node #404 ( RMS_NORM): norm-11 ( 16K) [CUDA0 ]: Qcur-11 (reshaped) ( 16K) [CUDA0 ] node #405 ( MUL): Qcur_normed-11 ( 16K) [CUDA0 ]: norm-11 ( 16K) [CUDA0 ] blk.11.attn_q_norm.w ( 1K) [CUDA0 ] node #406 ( ROPE): Qcur-11 ( 16K) [CUDA0 ]: Qcur_normed-11 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #407 ( MUL_MAT): Kcur-11 ( 8K) [CUDA0 ]: blk.11.attn_k.weight ( 4M) [CUDA0 ] attn_norm-11 ( 15K) [CUDA0 ] node #409 ( RMS_NORM): norm-11 ( 8K) [CUDA0 ]: Kcur-11 (reshaped) ( 8K) [CUDA0 ] node #410 ( MUL): Kcur_normed-11 ( 8K) [CUDA0 ]: norm-11 ( 8K) [CUDA0 ] blk.11.attn_k_norm.w ( 1K) [CUDA0 ] node #411 ( ROPE): Kcur-11 ( 8K) [CUDA0 ]: Kcur_normed-11 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #412 ( MUL_MAT): Vcur-11 ( 8K) [CUDA0 ]: blk.11.attn_v.weight ( 6M) [CUDA0 ] attn_norm-11 ( 15K) [CUDA0 ] node #414 ( CPY): k_cache_view-11 (cop ( 2K) [CUDA0 ]: Kcur-11 ( 8K) [CUDA0 ] k_cache_view-11 ( 2K) [CUDA0 ] node #416 ( CPY): v_cache_view-11 (cop ( 2K) [CUDA0 ]: Vcur-11 ( 8K) [CUDA0 ] v_cache_view-11 ( 2K) [CUDA0 ]
SPLIT #24: CPU # 3 inputs: [q-11 ( 16K)] [k-11 ( 544K)] [v-11 ( 544K)]
node #420 (FLASH_ATTN): node_420 ( 16K) [ CPU ]: CPU#q-11#0 ( 16K) [ NULL ] CPU#k-11#0 ( 544K) [ NULL ] CPU#v-11#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #25: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #422 ( MUL_MAT): kqv_out-11 ( 15K) [CUDA0 ]: blk.11.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #423 ( RMS_NORM): norm-11 ( 15K) [CUDA0 ]: kqv_out-11 ( 15K) [CUDA0 ] node #424 ( MUL): attn_post_norm-11 ( 15K) [CUDA0 ]: norm-11 ( 15K) [CUDA0 ] blk.11.post_attentio ( 15K) [CUDA0 ] node #425 ( ADD): sa_out-11 ( 15K) [CUDA0 ]: attn_post_norm-11 ( 15K) [CUDA0 ] l_out-10 ( 15K) [CUDA0 ] node #426 ( RMS_NORM): norm-11 ( 15K) [CUDA0 ]: sa_out-11 ( 15K) [CUDA0 ] node #427 ( MUL): ffn_norm-11 ( 15K) [CUDA0 ]: norm-11 ( 15K) [CUDA0 ] blk.11.ffn_norm.weig ( 15K) [CUDA0 ] node #428 ( MUL_MAT): ffn_gate-11 ( 60K) [CUDA0 ]: blk.11.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-11 ( 15K) [CUDA0 ] node #429 ( UNARY): ffn_gelu-11 ( 60K) [CUDA0 ]: ffn_gate-11 ( 60K) [CUDA0 ] node #430 ( MUL_MAT): ffn_up-11 ( 60K) [CUDA0 ]: blk.11.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-11 ( 15K) [CUDA0 ] node #431 ( MUL): ffn_gate_par-11 ( 60K) [CUDA0 ]: ffn_gelu-11 ( 60K) [CUDA0 ] ffn_up-11 ( 60K) [CUDA0 ] node #432 ( MUL_MAT): ffn_out-11 ( 15K) [CUDA0 ]: blk.11.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-11 ( 60K) [CUDA0 ] node #433 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-11 ( 15K) [CUDA0 ] node #434 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.11.post_ffw_norm ( 15K) [CUDA0 ] node #435 ( ADD): l_out-11 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-11 ( 15K) [CUDA0 ] node #436 ( RMS_NORM): norm-12 ( 15K) [CUDA0 ]: l_out-11 ( 15K) [CUDA0 ] node #437 ( MUL): attn_norm-12 ( 15K) [CUDA0 ]: norm-12 ( 15K) [CUDA0 ] blk.12.attn_norm.wei ( 15K) [CUDA0 ] node #438 ( MUL_MAT): Qcur-12 ( 16K) [CUDA0 ]: blk.12.attn_q.weight ( 8M) [CUDA0 ] attn_norm-12 ( 15K) [CUDA0 ] node #440 ( RMS_NORM): norm-12 ( 16K) [CUDA0 ]: Qcur-12 (reshaped) ( 16K) [CUDA0 ] node #441 ( MUL): Qcur_normed-12 ( 16K) [CUDA0 ]: norm-12 ( 16K) [CUDA0 ] blk.12.attn_q_norm.w ( 1K) [CUDA0 ] node #442 ( ROPE): Qcur-12 ( 16K) [CUDA0 ]: Qcur_normed-12 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #443 ( MUL_MAT): Kcur-12 ( 8K) [CUDA0 ]: blk.12.attn_k.weight ( 4M) [CUDA0 ] attn_norm-12 ( 15K) [CUDA0 ] node #445 ( RMS_NORM): norm-12 ( 8K) [CUDA0 ]: Kcur-12 (reshaped) ( 8K) [CUDA0 ] node #446 ( MUL): Kcur_normed-12 ( 8K) [CUDA0 ]: norm-12 ( 8K) [CUDA0 ] blk.12.attn_k_norm.w ( 1K) [CUDA0 ] node #447 ( ROPE): Kcur-12 ( 8K) [CUDA0 ]: Kcur_normed-12 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #448 ( MUL_MAT): Vcur-12 ( 8K) [CUDA0 ]: blk.12.attn_v.weight ( 4M) [CUDA0 ] attn_norm-12 ( 15K) [CUDA0 ] node #450 ( CPY): k_cache_view-12 (cop ( 2K) [CUDA0 ]: Kcur-12 ( 8K) [CUDA0 ] k_cache_view-12 ( 2K) [CUDA0 ] node #452 ( CPY): v_cache_view-12 (cop ( 2K) [CUDA0 ]: Vcur-12 ( 8K) [CUDA0 ] v_cache_view-12 ( 2K) [CUDA0 ]
SPLIT #26: CPU # 3 inputs: [q-12 ( 16K)] [k-12 ( 544K)] [v-12 ( 544K)]
node #456 (FLASH_ATTN): node_456 ( 16K) [ CPU ]: CPU#q-12#0 ( 16K) [ NULL ] CPU#k-12#0 ( 544K) [ NULL ] CPU#v-12#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #27: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #458 ( MUL_MAT): kqv_out-12 ( 15K) [CUDA0 ]: blk.12.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #459 ( RMS_NORM): norm-12 ( 15K) [CUDA0 ]: kqv_out-12 ( 15K) [CUDA0 ] node #460 ( MUL): attn_post_norm-12 ( 15K) [CUDA0 ]: norm-12 ( 15K) [CUDA0 ] blk.12.post_attentio ( 15K) [CUDA0 ] node #461 ( ADD): sa_out-12 ( 15K) [CUDA0 ]: attn_post_norm-12 ( 15K) [CUDA0 ] l_out-11 ( 15K) [CUDA0 ] node #462 ( RMS_NORM): norm-12 ( 15K) [CUDA0 ]: sa_out-12 ( 15K) [CUDA0 ] node #463 ( MUL): ffn_norm-12 ( 15K) [CUDA0 ]: norm-12 ( 15K) [CUDA0 ] blk.12.ffn_norm.weig ( 15K) [CUDA0 ] node #464 ( MUL_MAT): ffn_gate-12 ( 60K) [CUDA0 ]: blk.12.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-12 ( 15K) [CUDA0 ] node #465 ( UNARY): ffn_gelu-12 ( 60K) [CUDA0 ]: ffn_gate-12 ( 60K) [CUDA0 ] node #466 ( MUL_MAT): ffn_up-12 ( 60K) [CUDA0 ]: blk.12.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-12 ( 15K) [CUDA0 ] node #467 ( MUL): ffn_gate_par-12 ( 60K) [CUDA0 ]: ffn_gelu-12 ( 60K) [CUDA0 ] ffn_up-12 ( 60K) [CUDA0 ] node #468 ( MUL_MAT): ffn_out-12 ( 15K) [CUDA0 ]: blk.12.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-12 ( 60K) [CUDA0 ] node #469 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-12 ( 15K) [CUDA0 ] node #470 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.12.post_ffw_norm ( 15K) [CUDA0 ] node #471 ( ADD): l_out-12 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-12 ( 15K) [CUDA0 ] node #472 ( RMS_NORM): norm-13 ( 15K) [CUDA0 ]: l_out-12 ( 15K) [CUDA0 ] node #473 ( MUL): attn_norm-13 ( 15K) [CUDA0 ]: norm-13 ( 15K) [CUDA0 ] blk.13.attn_norm.wei ( 15K) [CUDA0 ] node #474 ( MUL_MAT): Qcur-13 ( 16K) [CUDA0 ]: blk.13.attn_q.weight ( 8M) [CUDA0 ] attn_norm-13 ( 15K) [CUDA0 ] node #476 ( RMS_NORM): norm-13 ( 16K) [CUDA0 ]: Qcur-13 (reshaped) ( 16K) [CUDA0 ] node #477 ( MUL): Qcur_normed-13 ( 16K) [CUDA0 ]: norm-13 ( 16K) [CUDA0 ] blk.13.attn_q_norm.w ( 1K) [CUDA0 ] node #478 ( ROPE): Qcur-13 ( 16K) [CUDA0 ]: Qcur_normed-13 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #479 ( MUL_MAT): Kcur-13 ( 8K) [CUDA0 ]: blk.13.attn_k.weight ( 4M) [CUDA0 ] attn_norm-13 ( 15K) [CUDA0 ] node #481 ( RMS_NORM): norm-13 ( 8K) [CUDA0 ]: Kcur-13 (reshaped) ( 8K) [CUDA0 ] node #482 ( MUL): Kcur_normed-13 ( 8K) [CUDA0 ]: norm-13 ( 8K) [CUDA0 ] blk.13.attn_k_norm.w ( 1K) [CUDA0 ] node #483 ( ROPE): Kcur-13 ( 8K) [CUDA0 ]: Kcur_normed-13 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #484 ( MUL_MAT): Vcur-13 ( 8K) [CUDA0 ]: blk.13.attn_v.weight ( 4M) [CUDA0 ] attn_norm-13 ( 15K) [CUDA0 ] node #486 ( CPY): k_cache_view-13 (cop ( 2K) [CUDA0 ]: Kcur-13 ( 8K) [CUDA0 ] k_cache_view-13 ( 2K) [CUDA0 ] node #488 ( CPY): v_cache_view-13 (cop ( 2K) [CUDA0 ]: Vcur-13 ( 8K) [CUDA0 ] v_cache_view-13 ( 2K) [CUDA0 ]
SPLIT #28: CPU # 3 inputs: [q-13 ( 16K)] [k-13 ( 544K)] [v-13 ( 544K)]
node #492 (FLASH_ATTN): node_492 ( 16K) [ CPU ]: CPU#q-13#0 ( 16K) [ NULL ] CPU#k-13#0 ( 544K) [ NULL ] CPU#v-13#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #29: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #494 ( MUL_MAT): kqv_out-13 ( 15K) [CUDA0 ]: blk.13.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #495 ( RMS_NORM): norm-13 ( 15K) [CUDA0 ]: kqv_out-13 ( 15K) [CUDA0 ] node #496 ( MUL): attn_post_norm-13 ( 15K) [CUDA0 ]: norm-13 ( 15K) [CUDA0 ] blk.13.post_attentio ( 15K) [CUDA0 ] node #497 ( ADD): sa_out-13 ( 15K) [CUDA0 ]: attn_post_norm-13 ( 15K) [CUDA0 ] l_out-12 ( 15K) [CUDA0 ] node #498 ( RMS_NORM): norm-13 ( 15K) [CUDA0 ]: sa_out-13 ( 15K) [CUDA0 ] node #499 ( MUL): ffn_norm-13 ( 15K) [CUDA0 ]: norm-13 ( 15K) [CUDA0 ] blk.13.ffn_norm.weig ( 15K) [CUDA0 ] node #500 ( MUL_MAT): ffn_gate-13 ( 60K) [CUDA0 ]: blk.13.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-13 ( 15K) [CUDA0 ] node #501 ( UNARY): ffn_gelu-13 ( 60K) [CUDA0 ]: ffn_gate-13 ( 60K) [CUDA0 ] node #502 ( MUL_MAT): ffn_up-13 ( 60K) [CUDA0 ]: blk.13.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-13 ( 15K) [CUDA0 ] node #503 ( MUL): ffn_gate_par-13 ( 60K) [CUDA0 ]: ffn_gelu-13 ( 60K) [CUDA0 ] ffn_up-13 ( 60K) [CUDA0 ] node #504 ( MUL_MAT): ffn_out-13 ( 15K) [CUDA0 ]: blk.13.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-13 ( 60K) [CUDA0 ] node #505 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-13 ( 15K) [CUDA0 ] node #506 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.13.post_ffw_norm ( 15K) [CUDA0 ] node #507 ( ADD): l_out-13 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-13 ( 15K) [CUDA0 ] node #508 ( RMS_NORM): norm-14 ( 15K) [CUDA0 ]: l_out-13 ( 15K) [CUDA0 ] node #509 ( MUL): attn_norm-14 ( 15K) [CUDA0 ]: norm-14 ( 15K) [CUDA0 ] blk.14.attn_norm.wei ( 15K) [CUDA0 ] node #510 ( MUL_MAT): Qcur-14 ( 16K) [CUDA0 ]: blk.14.attn_q.weight ( 8M) [CUDA0 ] attn_norm-14 ( 15K) [CUDA0 ] node #512 ( RMS_NORM): norm-14 ( 16K) [CUDA0 ]: Qcur-14 (reshaped) ( 16K) [CUDA0 ] node #513 ( MUL): Qcur_normed-14 ( 16K) [CUDA0 ]: norm-14 ( 16K) [CUDA0 ] blk.14.attn_q_norm.w ( 1K) [CUDA0 ] node #514 ( ROPE): Qcur-14 ( 16K) [CUDA0 ]: Qcur_normed-14 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #515 ( MUL_MAT): Kcur-14 ( 8K) [CUDA0 ]: blk.14.attn_k.weight ( 4M) [CUDA0 ] attn_norm-14 ( 15K) [CUDA0 ] node #517 ( RMS_NORM): norm-14 ( 8K) [CUDA0 ]: Kcur-14 (reshaped) ( 8K) [CUDA0 ] node #518 ( MUL): Kcur_normed-14 ( 8K) [CUDA0 ]: norm-14 ( 8K) [CUDA0 ] blk.14.attn_k_norm.w ( 1K) [CUDA0 ] node #519 ( ROPE): Kcur-14 ( 8K) [CUDA0 ]: Kcur_normed-14 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #520 ( MUL_MAT): Vcur-14 ( 8K) [CUDA0 ]: blk.14.attn_v.weight ( 6M) [CUDA0 ] attn_norm-14 ( 15K) [CUDA0 ] node #522 ( CPY): k_cache_view-14 (cop ( 2K) [CUDA0 ]: Kcur-14 ( 8K) [CUDA0 ] k_cache_view-14 ( 2K) [CUDA0 ] node #524 ( CPY): v_cache_view-14 (cop ( 2K) [CUDA0 ]: Vcur-14 ( 8K) [CUDA0 ] v_cache_view-14 ( 2K) [CUDA0 ]
SPLIT #30: CPU # 3 inputs: [q-14 ( 16K)] [k-14 ( 544K)] [v-14 ( 544K)]
node #528 (FLASH_ATTN): node_528 ( 16K) [ CPU ]: CPU#q-14#0 ( 16K) [ NULL ] CPU#k-14#0 ( 544K) [ NULL ] CPU#v-14#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #31: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #530 ( MUL_MAT): kqv_out-14 ( 15K) [CUDA0 ]: blk.14.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #531 ( RMS_NORM): norm-14 ( 15K) [CUDA0 ]: kqv_out-14 ( 15K) [CUDA0 ] node #532 ( MUL): attn_post_norm-14 ( 15K) [CUDA0 ]: norm-14 ( 15K) [CUDA0 ] blk.14.post_attentio ( 15K) [CUDA0 ] node #533 ( ADD): sa_out-14 ( 15K) [CUDA0 ]: attn_post_norm-14 ( 15K) [CUDA0 ] l_out-13 ( 15K) [CUDA0 ] node #534 ( RMS_NORM): norm-14 ( 15K) [CUDA0 ]: sa_out-14 ( 15K) [CUDA0 ] node #535 ( MUL): ffn_norm-14 ( 15K) [CUDA0 ]: norm-14 ( 15K) [CUDA0 ] blk.14.ffn_norm.weig ( 15K) [CUDA0 ] node #536 ( MUL_MAT): ffn_gate-14 ( 60K) [CUDA0 ]: blk.14.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-14 ( 15K) [CUDA0 ] node #537 ( UNARY): ffn_gelu-14 ( 60K) [CUDA0 ]: ffn_gate-14 ( 60K) [CUDA0 ] node #538 ( MUL_MAT): ffn_up-14 ( 60K) [CUDA0 ]: blk.14.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-14 ( 15K) [CUDA0 ] node #539 ( MUL): ffn_gate_par-14 ( 60K) [CUDA0 ]: ffn_gelu-14 ( 60K) [CUDA0 ] ffn_up-14 ( 60K) [CUDA0 ] node #540 ( MUL_MAT): ffn_out-14 ( 15K) [CUDA0 ]: blk.14.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-14 ( 60K) [CUDA0 ] node #541 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-14 ( 15K) [CUDA0 ] node #542 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.14.post_ffw_norm ( 15K) [CUDA0 ] node #543 ( ADD): l_out-14 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-14 ( 15K) [CUDA0 ] node #544 ( RMS_NORM): norm-15 ( 15K) [CUDA0 ]: l_out-14 ( 15K) [CUDA0 ] node #545 ( MUL): attn_norm-15 ( 15K) [CUDA0 ]: norm-15 ( 15K) [CUDA0 ] blk.15.attn_norm.wei ( 15K) [CUDA0 ] node #546 ( MUL_MAT): Qcur-15 ( 16K) [CUDA0 ]: blk.15.attn_q.weight ( 8M) [CUDA0 ] attn_norm-15 ( 15K) [CUDA0 ] node #548 ( RMS_NORM): norm-15 ( 16K) [CUDA0 ]: Qcur-15 (reshaped) ( 16K) [CUDA0 ] node #549 ( MUL): Qcur_normed-15 ( 16K) [CUDA0 ]: norm-15 ( 16K) [CUDA0 ] blk.15.attn_q_norm.w ( 1K) [CUDA0 ] node #550 ( ROPE): Qcur-15 ( 16K) [CUDA0 ]: Qcur_normed-15 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #551 ( MUL_MAT): Kcur-15 ( 8K) [CUDA0 ]: blk.15.attn_k.weight ( 4M) [CUDA0 ] attn_norm-15 ( 15K) [CUDA0 ] node #553 ( RMS_NORM): norm-15 ( 8K) [CUDA0 ]: Kcur-15 (reshaped) ( 8K) [CUDA0 ] node #554 ( MUL): Kcur_normed-15 ( 8K) [CUDA0 ]: norm-15 ( 8K) [CUDA0 ] blk.15.attn_k_norm.w ( 1K) [CUDA0 ] node #555 ( ROPE): Kcur-15 ( 8K) [CUDA0 ]: Kcur_normed-15 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #556 ( MUL_MAT): Vcur-15 ( 8K) [CUDA0 ]: blk.15.attn_v.weight ( 4M) [CUDA0 ] attn_norm-15 ( 15K) [CUDA0 ] node #558 ( CPY): k_cache_view-15 (cop ( 2K) [CUDA0 ]: Kcur-15 ( 8K) [CUDA0 ] k_cache_view-15 ( 2K) [CUDA0 ] node #560 ( CPY): v_cache_view-15 (cop ( 2K) [CUDA0 ]: Vcur-15 ( 8K) [CUDA0 ] v_cache_view-15 ( 2K) [CUDA0 ]
SPLIT #32: CPU # 3 inputs: [q-15 ( 16K)] [k-15 ( 544K)] [v-15 ( 544K)]
node #564 (FLASH_ATTN): node_564 ( 16K) [ CPU ]: CPU#q-15#0 ( 16K) [ NULL ] CPU#k-15#0 ( 544K) [ NULL ] CPU#v-15#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #33: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #566 ( MUL_MAT): kqv_out-15 ( 15K) [CUDA0 ]: blk.15.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #567 ( RMS_NORM): norm-15 ( 15K) [CUDA0 ]: kqv_out-15 ( 15K) [CUDA0 ] node #568 ( MUL): attn_post_norm-15 ( 15K) [CUDA0 ]: norm-15 ( 15K) [CUDA0 ] blk.15.post_attentio ( 15K) [CUDA0 ] node #569 ( ADD): sa_out-15 ( 15K) [CUDA0 ]: attn_post_norm-15 ( 15K) [CUDA0 ] l_out-14 ( 15K) [CUDA0 ] node #570 ( RMS_NORM): norm-15 ( 15K) [CUDA0 ]: sa_out-15 ( 15K) [CUDA0 ] node #571 ( MUL): ffn_norm-15 ( 15K) [CUDA0 ]: norm-15 ( 15K) [CUDA0 ] blk.15.ffn_norm.weig ( 15K) [CUDA0 ] node #572 ( MUL_MAT): ffn_gate-15 ( 60K) [CUDA0 ]: blk.15.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-15 ( 15K) [CUDA0 ] node #573 ( UNARY): ffn_gelu-15 ( 60K) [CUDA0 ]: ffn_gate-15 ( 60K) [CUDA0 ] node #574 ( MUL_MAT): ffn_up-15 ( 60K) [CUDA0 ]: blk.15.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-15 ( 15K) [CUDA0 ] node #575 ( MUL): ffn_gate_par-15 ( 60K) [CUDA0 ]: ffn_gelu-15 ( 60K) [CUDA0 ] ffn_up-15 ( 60K) [CUDA0 ] node #576 ( MUL_MAT): ffn_out-15 ( 15K) [CUDA0 ]: blk.15.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-15 ( 60K) [CUDA0 ] node #577 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-15 ( 15K) [CUDA0 ] node #578 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.15.post_ffw_norm ( 15K) [CUDA0 ] node #579 ( ADD): l_out-15 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-15 ( 15K) [CUDA0 ] node #580 ( RMS_NORM): norm-16 ( 15K) [CUDA0 ]: l_out-15 ( 15K) [CUDA0 ] node #581 ( MUL): attn_norm-16 ( 15K) [CUDA0 ]: norm-16 ( 15K) [CUDA0 ] blk.16.attn_norm.wei ( 15K) [CUDA0 ] node #582 ( MUL_MAT): Qcur-16 ( 16K) [CUDA0 ]: blk.16.attn_q.weight ( 8M) [CUDA0 ] attn_norm-16 ( 15K) [CUDA0 ] node #584 ( RMS_NORM): norm-16 ( 16K) [CUDA0 ]: Qcur-16 (reshaped) ( 16K) [CUDA0 ] node #585 ( MUL): Qcur_normed-16 ( 16K) [CUDA0 ]: norm-16 ( 16K) [CUDA0 ] blk.16.attn_q_norm.w ( 1K) [CUDA0 ] node #586 ( ROPE): Qcur-16 ( 16K) [CUDA0 ]: Qcur_normed-16 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #587 ( MUL_MAT): Kcur-16 ( 8K) [CUDA0 ]: blk.16.attn_k.weight ( 4M) [CUDA0 ] attn_norm-16 ( 15K) [CUDA0 ] node #589 ( RMS_NORM): norm-16 ( 8K) [CUDA0 ]: Kcur-16 (reshaped) ( 8K) [CUDA0 ] node #590 ( MUL): Kcur_normed-16 ( 8K) [CUDA0 ]: norm-16 ( 8K) [CUDA0 ] blk.16.attn_k_norm.w ( 1K) [CUDA0 ] node #591 ( ROPE): Kcur-16 ( 8K) [CUDA0 ]: Kcur_normed-16 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #592 ( MUL_MAT): Vcur-16 ( 8K) [CUDA0 ]: blk.16.attn_v.weight ( 4M) [CUDA0 ] attn_norm-16 ( 15K) [CUDA0 ] node #594 ( CPY): k_cache_view-16 (cop ( 2K) [CUDA0 ]: Kcur-16 ( 8K) [CUDA0 ] k_cache_view-16 ( 2K) [CUDA0 ] node #596 ( CPY): v_cache_view-16 (cop ( 2K) [CUDA0 ]: Vcur-16 ( 8K) [CUDA0 ] v_cache_view-16 ( 2K) [CUDA0 ]
SPLIT #34: CPU # 3 inputs: [q-16 ( 16K)] [k-16 ( 544K)] [v-16 ( 544K)]
node #600 (FLASH_ATTN): node_600 ( 16K) [ CPU ]: CPU#q-16#0 ( 16K) [ NULL ] CPU#k-16#0 ( 544K) [ NULL ] CPU#v-16#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #35: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #602 ( MUL_MAT): kqv_out-16 ( 15K) [CUDA0 ]: blk.16.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #603 ( RMS_NORM): norm-16 ( 15K) [CUDA0 ]: kqv_out-16 ( 15K) [CUDA0 ] node #604 ( MUL): attn_post_norm-16 ( 15K) [CUDA0 ]: norm-16 ( 15K) [CUDA0 ] blk.16.post_attentio ( 15K) [CUDA0 ] node #605 ( ADD): sa_out-16 ( 15K) [CUDA0 ]: attn_post_norm-16 ( 15K) [CUDA0 ] l_out-15 ( 15K) [CUDA0 ] node #606 ( RMS_NORM): norm-16 ( 15K) [CUDA0 ]: sa_out-16 ( 15K) [CUDA0 ] node #607 ( MUL): ffn_norm-16 ( 15K) [CUDA0 ]: norm-16 ( 15K) [CUDA0 ] blk.16.ffn_norm.weig ( 15K) [CUDA0 ] node #608 ( MUL_MAT): ffn_gate-16 ( 60K) [CUDA0 ]: blk.16.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-16 ( 15K) [CUDA0 ] node #609 ( UNARY): ffn_gelu-16 ( 60K) [CUDA0 ]: ffn_gate-16 ( 60K) [CUDA0 ] node #610 ( MUL_MAT): ffn_up-16 ( 60K) [CUDA0 ]: blk.16.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-16 ( 15K) [CUDA0 ] node #611 ( MUL): ffn_gate_par-16 ( 60K) [CUDA0 ]: ffn_gelu-16 ( 60K) [CUDA0 ] ffn_up-16 ( 60K) [CUDA0 ] node #612 ( MUL_MAT): ffn_out-16 ( 15K) [CUDA0 ]: blk.16.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-16 ( 60K) [CUDA0 ] node #613 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-16 ( 15K) [CUDA0 ] node #614 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.16.post_ffw_norm ( 15K) [CUDA0 ] node #615 ( ADD): l_out-16 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-16 ( 15K) [CUDA0 ] node #616 ( RMS_NORM): norm-17 ( 15K) [CUDA0 ]: l_out-16 ( 15K) [CUDA0 ] node #617 ( MUL): attn_norm-17 ( 15K) [CUDA0 ]: norm-17 ( 15K) [CUDA0 ] blk.17.attn_norm.wei ( 15K) [CUDA0 ] node #618 ( MUL_MAT): Qcur-17 ( 16K) [CUDA0 ]: blk.17.attn_q.weight ( 8M) [CUDA0 ] attn_norm-17 ( 15K) [CUDA0 ] node #620 ( RMS_NORM): norm-17 ( 16K) [CUDA0 ]: Qcur-17 (reshaped) ( 16K) [CUDA0 ] node #621 ( MUL): Qcur_normed-17 ( 16K) [CUDA0 ]: norm-17 ( 16K) [CUDA0 ] blk.17.attn_q_norm.w ( 1K) [CUDA0 ] node #622 ( ROPE): Qcur-17 ( 16K) [CUDA0 ]: Qcur_normed-17 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #623 ( MUL_MAT): Kcur-17 ( 8K) [CUDA0 ]: blk.17.attn_k.weight ( 4M) [CUDA0 ] attn_norm-17 ( 15K) [CUDA0 ] node #625 ( RMS_NORM): norm-17 ( 8K) [CUDA0 ]: Kcur-17 (reshaped) ( 8K) [CUDA0 ] node #626 ( MUL): Kcur_normed-17 ( 8K) [CUDA0 ]: norm-17 ( 8K) [CUDA0 ] blk.17.attn_k_norm.w ( 1K) [CUDA0 ] node #627 ( ROPE): Kcur-17 ( 8K) [CUDA0 ]: Kcur_normed-17 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #628 ( MUL_MAT): Vcur-17 ( 8K) [CUDA0 ]: blk.17.attn_v.weight ( 6M) [CUDA0 ] attn_norm-17 ( 15K) [CUDA0 ] node #630 ( CPY): k_cache_view-17 (cop ( 2K) [CUDA0 ]: Kcur-17 ( 8K) [CUDA0 ] k_cache_view-17 ( 2K) [CUDA0 ] node #632 ( CPY): v_cache_view-17 (cop ( 2K) [CUDA0 ]: Vcur-17 ( 8K) [CUDA0 ] v_cache_view-17 ( 2K) [CUDA0 ]
SPLIT #36: CPU # 3 inputs: [q-17 ( 16K)] [k-17 ( 544K)] [v-17 ( 544K)]
node #636 (FLASH_ATTN): node_636 ( 16K) [ CPU ]: CPU#q-17#0 ( 16K) [ NULL ] CPU#k-17#0 ( 544K) [ NULL ] CPU#v-17#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #37: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #638 ( MUL_MAT): kqv_out-17 ( 15K) [CUDA0 ]: blk.17.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #639 ( RMS_NORM): norm-17 ( 15K) [CUDA0 ]: kqv_out-17 ( 15K) [CUDA0 ] node #640 ( MUL): attn_post_norm-17 ( 15K) [CUDA0 ]: norm-17 ( 15K) [CUDA0 ] blk.17.post_attentio ( 15K) [CUDA0 ] node #641 ( ADD): sa_out-17 ( 15K) [CUDA0 ]: attn_post_norm-17 ( 15K) [CUDA0 ] l_out-16 ( 15K) [CUDA0 ] node #642 ( RMS_NORM): norm-17 ( 15K) [CUDA0 ]: sa_out-17 ( 15K) [CUDA0 ] node #643 ( MUL): ffn_norm-17 ( 15K) [CUDA0 ]: norm-17 ( 15K) [CUDA0 ] blk.17.ffn_norm.weig ( 15K) [CUDA0 ] node #644 ( MUL_MAT): ffn_gate-17 ( 60K) [CUDA0 ]: blk.17.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-17 ( 15K) [CUDA0 ] node #645 ( UNARY): ffn_gelu-17 ( 60K) [CUDA0 ]: ffn_gate-17 ( 60K) [CUDA0 ] node #646 ( MUL_MAT): ffn_up-17 ( 60K) [CUDA0 ]: blk.17.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-17 ( 15K) [CUDA0 ] node #647 ( MUL): ffn_gate_par-17 ( 60K) [CUDA0 ]: ffn_gelu-17 ( 60K) [CUDA0 ] ffn_up-17 ( 60K) [CUDA0 ] node #648 ( MUL_MAT): ffn_out-17 ( 15K) [CUDA0 ]: blk.17.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-17 ( 60K) [CUDA0 ] node #649 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-17 ( 15K) [CUDA0 ] node #650 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.17.post_ffw_norm ( 15K) [CUDA0 ] node #651 ( ADD): l_out-17 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-17 ( 15K) [CUDA0 ] node #652 ( RMS_NORM): norm-18 ( 15K) [CUDA0 ]: l_out-17 ( 15K) [CUDA0 ] node #653 ( MUL): attn_norm-18 ( 15K) [CUDA0 ]: norm-18 ( 15K) [CUDA0 ] blk.18.attn_norm.wei ( 15K) [CUDA0 ] node #654 ( MUL_MAT): Qcur-18 ( 16K) [CUDA0 ]: blk.18.attn_q.weight ( 8M) [CUDA0 ] attn_norm-18 ( 15K) [CUDA0 ] node #656 ( RMS_NORM): norm-18 ( 16K) [CUDA0 ]: Qcur-18 (reshaped) ( 16K) [CUDA0 ] node #657 ( MUL): Qcur_normed-18 ( 16K) [CUDA0 ]: norm-18 ( 16K) [CUDA0 ] blk.18.attn_q_norm.w ( 1K) [CUDA0 ] node #658 ( ROPE): Qcur-18 ( 16K) [CUDA0 ]: Qcur_normed-18 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #659 ( MUL_MAT): Kcur-18 ( 8K) [CUDA0 ]: blk.18.attn_k.weight ( 4M) [CUDA0 ] attn_norm-18 ( 15K) [CUDA0 ] node #661 ( RMS_NORM): norm-18 ( 8K) [CUDA0 ]: Kcur-18 (reshaped) ( 8K) [CUDA0 ] node #662 ( MUL): Kcur_normed-18 ( 8K) [CUDA0 ]: norm-18 ( 8K) [CUDA0 ] blk.18.attn_k_norm.w ( 1K) [CUDA0 ] node #663 ( ROPE): Kcur-18 ( 8K) [CUDA0 ]: Kcur_normed-18 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #664 ( MUL_MAT): Vcur-18 ( 8K) [CUDA0 ]: blk.18.attn_v.weight ( 4M) [CUDA0 ] attn_norm-18 ( 15K) [CUDA0 ] node #666 ( CPY): k_cache_view-18 (cop ( 2K) [CUDA0 ]: Kcur-18 ( 8K) [CUDA0 ] k_cache_view-18 ( 2K) [CUDA0 ] node #668 ( CPY): v_cache_view-18 (cop ( 2K) [CUDA0 ]: Vcur-18 ( 8K) [CUDA0 ] v_cache_view-18 ( 2K) [CUDA0 ]
SPLIT #38: CPU # 3 inputs: [q-18 ( 16K)] [k-18 ( 544K)] [v-18 ( 544K)]
node #672 (FLASH_ATTN): node_672 ( 16K) [ CPU ]: CPU#q-18#0 ( 16K) [ NULL ] CPU#k-18#0 ( 544K) [ NULL ] CPU#v-18#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #39: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #674 ( MUL_MAT): kqv_out-18 ( 15K) [CUDA0 ]: blk.18.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #675 ( RMS_NORM): norm-18 ( 15K) [CUDA0 ]: kqv_out-18 ( 15K) [CUDA0 ] node #676 ( MUL): attn_post_norm-18 ( 15K) [CUDA0 ]: norm-18 ( 15K) [CUDA0 ] blk.18.post_attentio ( 15K) [CUDA0 ] node #677 ( ADD): sa_out-18 ( 15K) [CUDA0 ]: attn_post_norm-18 ( 15K) [CUDA0 ] l_out-17 ( 15K) [CUDA0 ] node #678 ( RMS_NORM): norm-18 ( 15K) [CUDA0 ]: sa_out-18 ( 15K) [CUDA0 ] node #679 ( MUL): ffn_norm-18 ( 15K) [CUDA0 ]: norm-18 ( 15K) [CUDA0 ] blk.18.ffn_norm.weig ( 15K) [CUDA0 ] node #680 ( MUL_MAT): ffn_gate-18 ( 60K) [CUDA0 ]: blk.18.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-18 ( 15K) [CUDA0 ] node #681 ( UNARY): ffn_gelu-18 ( 60K) [CUDA0 ]: ffn_gate-18 ( 60K) [CUDA0 ] node #682 ( MUL_MAT): ffn_up-18 ( 60K) [CUDA0 ]: blk.18.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-18 ( 15K) [CUDA0 ] node #683 ( MUL): ffn_gate_par-18 ( 60K) [CUDA0 ]: ffn_gelu-18 ( 60K) [CUDA0 ] ffn_up-18 ( 60K) [CUDA0 ] node #684 ( MUL_MAT): ffn_out-18 ( 15K) [CUDA0 ]: blk.18.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-18 ( 60K) [CUDA0 ] node #685 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-18 ( 15K) [CUDA0 ] node #686 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.18.post_ffw_norm ( 15K) [CUDA0 ] node #687 ( ADD): l_out-18 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-18 ( 15K) [CUDA0 ] node #688 ( RMS_NORM): norm-19 ( 15K) [CUDA0 ]: l_out-18 ( 15K) [CUDA0 ] node #689 ( MUL): attn_norm-19 ( 15K) [CUDA0 ]: norm-19 ( 15K) [CUDA0 ] blk.19.attn_norm.wei ( 15K) [CUDA0 ] node #690 ( MUL_MAT): Qcur-19 ( 16K) [CUDA0 ]: blk.19.attn_q.weight ( 8M) [CUDA0 ] attn_norm-19 ( 15K) [CUDA0 ] node #692 ( RMS_NORM): norm-19 ( 16K) [CUDA0 ]: Qcur-19 (reshaped) ( 16K) [CUDA0 ] node #693 ( MUL): Qcur_normed-19 ( 16K) [CUDA0 ]: norm-19 ( 16K) [CUDA0 ] blk.19.attn_q_norm.w ( 1K) [CUDA0 ] node #694 ( ROPE): Qcur-19 ( 16K) [CUDA0 ]: Qcur_normed-19 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #695 ( MUL_MAT): Kcur-19 ( 8K) [CUDA0 ]: blk.19.attn_k.weight ( 4M) [CUDA0 ] attn_norm-19 ( 15K) [CUDA0 ] node #697 ( RMS_NORM): norm-19 ( 8K) [CUDA0 ]: Kcur-19 (reshaped) ( 8K) [CUDA0 ] node #698 ( MUL): Kcur_normed-19 ( 8K) [CUDA0 ]: norm-19 ( 8K) [CUDA0 ] blk.19.attn_k_norm.w ( 1K) [CUDA0 ] node #699 ( ROPE): Kcur-19 ( 8K) [CUDA0 ]: Kcur_normed-19 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #700 ( MUL_MAT): Vcur-19 ( 8K) [CUDA0 ]: blk.19.attn_v.weight ( 4M) [CUDA0 ] attn_norm-19 ( 15K) [CUDA0 ] node #702 ( CPY): k_cache_view-19 (cop ( 2K) [CUDA0 ]: Kcur-19 ( 8K) [CUDA0 ] k_cache_view-19 ( 2K) [CUDA0 ] node #704 ( CPY): v_cache_view-19 (cop ( 2K) [CUDA0 ]: Vcur-19 ( 8K) [CUDA0 ] v_cache_view-19 ( 2K) [CUDA0 ]
SPLIT #40: CPU # 3 inputs: [q-19 ( 16K)] [k-19 ( 544K)] [v-19 ( 544K)]
node #708 (FLASH_ATTN): node_708 ( 16K) [ CPU ]: CPU#q-19#0 ( 16K) [ NULL ] CPU#k-19#0 ( 544K) [ NULL ] CPU#v-19#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #41: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #710 ( MUL_MAT): kqv_out-19 ( 15K) [CUDA0 ]: blk.19.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #711 ( RMS_NORM): norm-19 ( 15K) [CUDA0 ]: kqv_out-19 ( 15K) [CUDA0 ] node #712 ( MUL): attn_post_norm-19 ( 15K) [CUDA0 ]: norm-19 ( 15K) [CUDA0 ] blk.19.post_attentio ( 15K) [CUDA0 ] node #713 ( ADD): sa_out-19 ( 15K) [CUDA0 ]: attn_post_norm-19 ( 15K) [CUDA0 ] l_out-18 ( 15K) [CUDA0 ] node #714 ( RMS_NORM): norm-19 ( 15K) [CUDA0 ]: sa_out-19 ( 15K) [CUDA0 ] node #715 ( MUL): ffn_norm-19 ( 15K) [CUDA0 ]: norm-19 ( 15K) [CUDA0 ] blk.19.ffn_norm.weig ( 15K) [CUDA0 ] node #716 ( MUL_MAT): ffn_gate-19 ( 60K) [CUDA0 ]: blk.19.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-19 ( 15K) [CUDA0 ] node #717 ( UNARY): ffn_gelu-19 ( 60K) [CUDA0 ]: ffn_gate-19 ( 60K) [CUDA0 ] node #718 ( MUL_MAT): ffn_up-19 ( 60K) [CUDA0 ]: blk.19.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-19 ( 15K) [CUDA0 ] node #719 ( MUL): ffn_gate_par-19 ( 60K) [CUDA0 ]: ffn_gelu-19 ( 60K) [CUDA0 ] ffn_up-19 ( 60K) [CUDA0 ] node #720 ( MUL_MAT): ffn_out-19 ( 15K) [CUDA0 ]: blk.19.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-19 ( 60K) [CUDA0 ] node #721 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-19 ( 15K) [CUDA0 ] node #722 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.19.post_ffw_norm ( 15K) [CUDA0 ] node #723 ( ADD): l_out-19 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-19 ( 15K) [CUDA0 ] node #724 ( RMS_NORM): norm-20 ( 15K) [CUDA0 ]: l_out-19 ( 15K) [CUDA0 ] node #725 ( MUL): attn_norm-20 ( 15K) [CUDA0 ]: norm-20 ( 15K) [CUDA0 ] blk.20.attn_norm.wei ( 15K) [CUDA0 ] node #726 ( MUL_MAT): Qcur-20 ( 16K) [CUDA0 ]: blk.20.attn_q.weight ( 8M) [CUDA0 ] attn_norm-20 ( 15K) [CUDA0 ] node #728 ( RMS_NORM): norm-20 ( 16K) [CUDA0 ]: Qcur-20 (reshaped) ( 16K) [CUDA0 ] node #729 ( MUL): Qcur_normed-20 ( 16K) [CUDA0 ]: norm-20 ( 16K) [CUDA0 ] blk.20.attn_q_norm.w ( 1K) [CUDA0 ] node #730 ( ROPE): Qcur-20 ( 16K) [CUDA0 ]: Qcur_normed-20 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #731 ( MUL_MAT): Kcur-20 ( 8K) [CUDA0 ]: blk.20.attn_k.weight ( 4M) [CUDA0 ] attn_norm-20 ( 15K) [CUDA0 ] node #733 ( RMS_NORM): norm-20 ( 8K) [CUDA0 ]: Kcur-20 (reshaped) ( 8K) [CUDA0 ] node #734 ( MUL): Kcur_normed-20 ( 8K) [CUDA0 ]: norm-20 ( 8K) [CUDA0 ] blk.20.attn_k_norm.w ( 1K) [CUDA0 ] node #735 ( ROPE): Kcur-20 ( 8K) [CUDA0 ]: Kcur_normed-20 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #736 ( MUL_MAT): Vcur-20 ( 8K) [CUDA0 ]: blk.20.attn_v.weight ( 6M) [CUDA0 ] attn_norm-20 ( 15K) [CUDA0 ] node #738 ( CPY): k_cache_view-20 (cop ( 2K) [CUDA0 ]: Kcur-20 ( 8K) [CUDA0 ] k_cache_view-20 ( 2K) [CUDA0 ] node #740 ( CPY): v_cache_view-20 (cop ( 2K) [CUDA0 ]: Vcur-20 ( 8K) [CUDA0 ] v_cache_view-20 ( 2K) [CUDA0 ]
SPLIT #42: CPU # 3 inputs: [q-20 ( 16K)] [k-20 ( 544K)] [v-20 ( 544K)]
node #744 (FLASH_ATTN): node_744 ( 16K) [ CPU ]: CPU#q-20#0 ( 16K) [ NULL ] CPU#k-20#0 ( 544K) [ NULL ] CPU#v-20#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #43: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #746 ( MUL_MAT): kqv_out-20 ( 15K) [CUDA0 ]: blk.20.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #747 ( RMS_NORM): norm-20 ( 15K) [CUDA0 ]: kqv_out-20 ( 15K) [CUDA0 ] node #748 ( MUL): attn_post_norm-20 ( 15K) [CUDA0 ]: norm-20 ( 15K) [CUDA0 ] blk.20.post_attentio ( 15K) [CUDA0 ] node #749 ( ADD): sa_out-20 ( 15K) [CUDA0 ]: attn_post_norm-20 ( 15K) [CUDA0 ] l_out-19 ( 15K) [CUDA0 ] node #750 ( RMS_NORM): norm-20 ( 15K) [CUDA0 ]: sa_out-20 ( 15K) [CUDA0 ] node #751 ( MUL): ffn_norm-20 ( 15K) [CUDA0 ]: norm-20 ( 15K) [CUDA0 ] blk.20.ffn_norm.weig ( 15K) [CUDA0 ] node #752 ( MUL_MAT): ffn_gate-20 ( 60K) [CUDA0 ]: blk.20.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-20 ( 15K) [CUDA0 ] node #753 ( UNARY): ffn_gelu-20 ( 60K) [CUDA0 ]: ffn_gate-20 ( 60K) [CUDA0 ] node #754 ( MUL_MAT): ffn_up-20 ( 60K) [CUDA0 ]: blk.20.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-20 ( 15K) [CUDA0 ] node #755 ( MUL): ffn_gate_par-20 ( 60K) [CUDA0 ]: ffn_gelu-20 ( 60K) [CUDA0 ] ffn_up-20 ( 60K) [CUDA0 ] node #756 ( MUL_MAT): ffn_out-20 ( 15K) [CUDA0 ]: blk.20.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-20 ( 60K) [CUDA0 ] node #757 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-20 ( 15K) [CUDA0 ] node #758 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.20.post_ffw_norm ( 15K) [CUDA0 ] node #759 ( ADD): l_out-20 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-20 ( 15K) [CUDA0 ] node #760 ( RMS_NORM): norm-21 ( 15K) [CUDA0 ]: l_out-20 ( 15K) [CUDA0 ] node #761 ( MUL): attn_norm-21 ( 15K) [CUDA0 ]: norm-21 ( 15K) [CUDA0 ] blk.21.attn_norm.wei ( 15K) [CUDA0 ] node #762 ( MUL_MAT): Qcur-21 ( 16K) [CUDA0 ]: blk.21.attn_q.weight ( 8M) [CUDA0 ] attn_norm-21 ( 15K) [CUDA0 ] node #764 ( RMS_NORM): norm-21 ( 16K) [CUDA0 ]: Qcur-21 (reshaped) ( 16K) [CUDA0 ] node #765 ( MUL): Qcur_normed-21 ( 16K) [CUDA0 ]: norm-21 ( 16K) [CUDA0 ] blk.21.attn_q_norm.w ( 1K) [CUDA0 ] node #766 ( ROPE): Qcur-21 ( 16K) [CUDA0 ]: Qcur_normed-21 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #767 ( MUL_MAT): Kcur-21 ( 8K) [CUDA0 ]: blk.21.attn_k.weight ( 4M) [CUDA0 ] attn_norm-21 ( 15K) [CUDA0 ] node #769 ( RMS_NORM): norm-21 ( 8K) [CUDA0 ]: Kcur-21 (reshaped) ( 8K) [CUDA0 ] node #770 ( MUL): Kcur_normed-21 ( 8K) [CUDA0 ]: norm-21 ( 8K) [CUDA0 ] blk.21.attn_k_norm.w ( 1K) [CUDA0 ] node #771 ( ROPE): Kcur-21 ( 8K) [CUDA0 ]: Kcur_normed-21 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #772 ( MUL_MAT): Vcur-21 ( 8K) [CUDA0 ]: blk.21.attn_v.weight ( 4M) [CUDA0 ] attn_norm-21 ( 15K) [CUDA0 ] node #774 ( CPY): k_cache_view-21 (cop ( 2K) [CUDA0 ]: Kcur-21 ( 8K) [CUDA0 ] k_cache_view-21 ( 2K) [CUDA0 ] node #776 ( CPY): v_cache_view-21 (cop ( 2K) [CUDA0 ]: Vcur-21 ( 8K) [CUDA0 ] v_cache_view-21 ( 2K) [CUDA0 ]
SPLIT #44: CPU # 3 inputs: [q-21 ( 16K)] [k-21 ( 544K)] [v-21 ( 544K)]
node #780 (FLASH_ATTN): node_780 ( 16K) [ CPU ]: CPU#q-21#0 ( 16K) [ NULL ] CPU#k-21#0 ( 544K) [ NULL ] CPU#v-21#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #45: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #782 ( MUL_MAT): kqv_out-21 ( 15K) [CUDA0 ]: blk.21.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #783 ( RMS_NORM): norm-21 ( 15K) [CUDA0 ]: kqv_out-21 ( 15K) [CUDA0 ] node #784 ( MUL): attn_post_norm-21 ( 15K) [CUDA0 ]: norm-21 ( 15K) [CUDA0 ] blk.21.post_attentio ( 15K) [CUDA0 ] node #785 ( ADD): sa_out-21 ( 15K) [CUDA0 ]: attn_post_norm-21 ( 15K) [CUDA0 ] l_out-20 ( 15K) [CUDA0 ] node #786 ( RMS_NORM): norm-21 ( 15K) [CUDA0 ]: sa_out-21 ( 15K) [CUDA0 ] node #787 ( MUL): ffn_norm-21 ( 15K) [CUDA0 ]: norm-21 ( 15K) [CUDA0 ] blk.21.ffn_norm.weig ( 15K) [CUDA0 ] node #788 ( MUL_MAT): ffn_gate-21 ( 60K) [CUDA0 ]: blk.21.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-21 ( 15K) [CUDA0 ] node #789 ( UNARY): ffn_gelu-21 ( 60K) [CUDA0 ]: ffn_gate-21 ( 60K) [CUDA0 ] node #790 ( MUL_MAT): ffn_up-21 ( 60K) [CUDA0 ]: blk.21.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-21 ( 15K) [CUDA0 ] node #791 ( MUL): ffn_gate_par-21 ( 60K) [CUDA0 ]: ffn_gelu-21 ( 60K) [CUDA0 ] ffn_up-21 ( 60K) [CUDA0 ] node #792 ( MUL_MAT): ffn_out-21 ( 15K) [CUDA0 ]: blk.21.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-21 ( 60K) [CUDA0 ] node #793 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-21 ( 15K) [CUDA0 ] node #794 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.21.post_ffw_norm ( 15K) [CUDA0 ] node #795 ( ADD): l_out-21 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-21 ( 15K) [CUDA0 ] node #796 ( RMS_NORM): norm-22 ( 15K) [CUDA0 ]: l_out-21 ( 15K) [CUDA0 ] node #797 ( MUL): attn_norm-22 ( 15K) [CUDA0 ]: norm-22 ( 15K) [CUDA0 ] blk.22.attn_norm.wei ( 15K) [CUDA0 ] node #798 ( MUL_MAT): Qcur-22 ( 16K) [CUDA0 ]: blk.22.attn_q.weight ( 8M) [CUDA0 ] attn_norm-22 ( 15K) [CUDA0 ] node #800 ( RMS_NORM): norm-22 ( 16K) [CUDA0 ]: Qcur-22 (reshaped) ( 16K) [CUDA0 ] node #801 ( MUL): Qcur_normed-22 ( 16K) [CUDA0 ]: norm-22 ( 16K) [CUDA0 ] blk.22.attn_q_norm.w ( 1K) [CUDA0 ] node #802 ( ROPE): Qcur-22 ( 16K) [CUDA0 ]: Qcur_normed-22 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #803 ( MUL_MAT): Kcur-22 ( 8K) [CUDA0 ]: blk.22.attn_k.weight ( 4M) [CUDA0 ] attn_norm-22 ( 15K) [CUDA0 ] node #805 ( RMS_NORM): norm-22 ( 8K) [CUDA0 ]: Kcur-22 (reshaped) ( 8K) [CUDA0 ] node #806 ( MUL): Kcur_normed-22 ( 8K) [CUDA0 ]: norm-22 ( 8K) [CUDA0 ] blk.22.attn_k_norm.w ( 1K) [CUDA0 ] node #807 ( ROPE): Kcur-22 ( 8K) [CUDA0 ]: Kcur_normed-22 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #808 ( MUL_MAT): Vcur-22 ( 8K) [CUDA0 ]: blk.22.attn_v.weight ( 4M) [CUDA0 ] attn_norm-22 ( 15K) [CUDA0 ] node #810 ( CPY): k_cache_view-22 (cop ( 2K) [CUDA0 ]: Kcur-22 ( 8K) [CUDA0 ] k_cache_view-22 ( 2K) [CUDA0 ] node #812 ( CPY): v_cache_view-22 (cop ( 2K) [CUDA0 ]: Vcur-22 ( 8K) [CUDA0 ] v_cache_view-22 ( 2K) [CUDA0 ]
SPLIT #46: CPU # 3 inputs: [q-22 ( 16K)] [k-22 ( 544K)] [v-22 ( 544K)]
node #816 (FLASH_ATTN): node_816 ( 16K) [ CPU ]: CPU#q-22#0 ( 16K) [ NULL ] CPU#k-22#0 ( 544K) [ NULL ] CPU#v-22#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #47: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #818 ( MUL_MAT): kqv_out-22 ( 15K) [CUDA0 ]: blk.22.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #819 ( RMS_NORM): norm-22 ( 15K) [CUDA0 ]: kqv_out-22 ( 15K) [CUDA0 ] node #820 ( MUL): attn_post_norm-22 ( 15K) [CUDA0 ]: norm-22 ( 15K) [CUDA0 ] blk.22.post_attentio ( 15K) [CUDA0 ] node #821 ( ADD): sa_out-22 ( 15K) [CUDA0 ]: attn_post_norm-22 ( 15K) [CUDA0 ] l_out-21 ( 15K) [CUDA0 ] node #822 ( RMS_NORM): norm-22 ( 15K) [CUDA0 ]: sa_out-22 ( 15K) [CUDA0 ] node #823 ( MUL): ffn_norm-22 ( 15K) [CUDA0 ]: norm-22 ( 15K) [CUDA0 ] blk.22.ffn_norm.weig ( 15K) [CUDA0 ] node #824 ( MUL_MAT): ffn_gate-22 ( 60K) [CUDA0 ]: blk.22.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-22 ( 15K) [CUDA0 ] node #825 ( UNARY): ffn_gelu-22 ( 60K) [CUDA0 ]: ffn_gate-22 ( 60K) [CUDA0 ] node #826 ( MUL_MAT): ffn_up-22 ( 60K) [CUDA0 ]: blk.22.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-22 ( 15K) [CUDA0 ] node #827 ( MUL): ffn_gate_par-22 ( 60K) [CUDA0 ]: ffn_gelu-22 ( 60K) [CUDA0 ] ffn_up-22 ( 60K) [CUDA0 ] node #828 ( MUL_MAT): ffn_out-22 ( 15K) [CUDA0 ]: blk.22.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-22 ( 60K) [CUDA0 ] node #829 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-22 ( 15K) [CUDA0 ] node #830 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.22.post_ffw_norm ( 15K) [CUDA0 ] node #831 ( ADD): l_out-22 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-22 ( 15K) [CUDA0 ] node #832 ( RMS_NORM): norm-23 ( 15K) [CUDA0 ]: l_out-22 ( 15K) [CUDA0 ] node #833 ( MUL): attn_norm-23 ( 15K) [CUDA0 ]: norm-23 ( 15K) [CUDA0 ] blk.23.attn_norm.wei ( 15K) [CUDA0 ] node #834 ( MUL_MAT): Qcur-23 ( 16K) [CUDA0 ]: blk.23.attn_q.weight ( 8M) [CUDA0 ] attn_norm-23 ( 15K) [CUDA0 ] node #836 ( RMS_NORM): norm-23 ( 16K) [CUDA0 ]: Qcur-23 (reshaped) ( 16K) [CUDA0 ] node #837 ( MUL): Qcur_normed-23 ( 16K) [CUDA0 ]: norm-23 ( 16K) [CUDA0 ] blk.23.attn_q_norm.w ( 1K) [CUDA0 ] node #838 ( ROPE): Qcur-23 ( 16K) [CUDA0 ]: Qcur_normed-23 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #839 ( MUL_MAT): Kcur-23 ( 8K) [CUDA0 ]: blk.23.attn_k.weight ( 4M) [CUDA0 ] attn_norm-23 ( 15K) [CUDA0 ] node #841 ( RMS_NORM): norm-23 ( 8K) [CUDA0 ]: Kcur-23 (reshaped) ( 8K) [CUDA0 ] node #842 ( MUL): Kcur_normed-23 ( 8K) [CUDA0 ]: norm-23 ( 8K) [CUDA0 ] blk.23.attn_k_norm.w ( 1K) [CUDA0 ] node #843 ( ROPE): Kcur-23 ( 8K) [CUDA0 ]: Kcur_normed-23 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #844 ( MUL_MAT): Vcur-23 ( 8K) [CUDA0 ]: blk.23.attn_v.weight ( 6M) [CUDA0 ] attn_norm-23 ( 15K) [CUDA0 ] node #846 ( CPY): k_cache_view-23 (cop ( 2K) [CUDA0 ]: Kcur-23 ( 8K) [CUDA0 ] k_cache_view-23 ( 2K) [CUDA0 ] node #848 ( CPY): v_cache_view-23 (cop ( 2K) [CUDA0 ]: Vcur-23 ( 8K) [CUDA0 ] v_cache_view-23 ( 2K) [CUDA0 ]
SPLIT #48: CPU # 3 inputs: [q-23 ( 16K)] [k-23 ( 544K)] [v-23 ( 544K)]
node #852 (FLASH_ATTN): node_852 ( 16K) [ CPU ]: CPU#q-23#0 ( 16K) [ NULL ] CPU#k-23#0 ( 544K) [ NULL ] CPU#v-23#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #49: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #854 ( MUL_MAT): kqv_out-23 ( 15K) [CUDA0 ]: blk.23.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #855 ( RMS_NORM): norm-23 ( 15K) [CUDA0 ]: kqv_out-23 ( 15K) [CUDA0 ] node #856 ( MUL): attn_post_norm-23 ( 15K) [CUDA0 ]: norm-23 ( 15K) [CUDA0 ] blk.23.post_attentio ( 15K) [CUDA0 ] node #857 ( ADD): sa_out-23 ( 15K) [CUDA0 ]: attn_post_norm-23 ( 15K) [CUDA0 ] l_out-22 ( 15K) [CUDA0 ] node #858 ( RMS_NORM): norm-23 ( 15K) [CUDA0 ]: sa_out-23 ( 15K) [CUDA0 ] node #859 ( MUL): ffn_norm-23 ( 15K) [CUDA0 ]: norm-23 ( 15K) [CUDA0 ] blk.23.ffn_norm.weig ( 15K) [CUDA0 ] node #860 ( MUL_MAT): ffn_gate-23 ( 60K) [CUDA0 ]: blk.23.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-23 ( 15K) [CUDA0 ] node #861 ( UNARY): ffn_gelu-23 ( 60K) [CUDA0 ]: ffn_gate-23 ( 60K) [CUDA0 ] node #862 ( MUL_MAT): ffn_up-23 ( 60K) [CUDA0 ]: blk.23.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-23 ( 15K) [CUDA0 ] node #863 ( MUL): ffn_gate_par-23 ( 60K) [CUDA0 ]: ffn_gelu-23 ( 60K) [CUDA0 ] ffn_up-23 ( 60K) [CUDA0 ] node #864 ( MUL_MAT): ffn_out-23 ( 15K) [CUDA0 ]: blk.23.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-23 ( 60K) [CUDA0 ] node #865 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-23 ( 15K) [CUDA0 ] node #866 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.23.post_ffw_norm ( 15K) [CUDA0 ] node #867 ( ADD): l_out-23 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-23 ( 15K) [CUDA0 ] node #868 ( RMS_NORM): norm-24 ( 15K) [CUDA0 ]: l_out-23 ( 15K) [CUDA0 ] node #869 ( MUL): attn_norm-24 ( 15K) [CUDA0 ]: norm-24 ( 15K) [CUDA0 ] blk.24.attn_norm.wei ( 15K) [CUDA0 ] node #870 ( MUL_MAT): Qcur-24 ( 16K) [CUDA0 ]: blk.24.attn_q.weight ( 8M) [CUDA0 ] attn_norm-24 ( 15K) [CUDA0 ] node #872 ( RMS_NORM): norm-24 ( 16K) [CUDA0 ]: Qcur-24 (reshaped) ( 16K) [CUDA0 ] node #873 ( MUL): Qcur_normed-24 ( 16K) [CUDA0 ]: norm-24 ( 16K) [CUDA0 ] blk.24.attn_q_norm.w ( 1K) [CUDA0 ] node #874 ( ROPE): Qcur-24 ( 16K) [CUDA0 ]: Qcur_normed-24 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #875 ( MUL_MAT): Kcur-24 ( 8K) [CUDA0 ]: blk.24.attn_k.weight ( 4M) [CUDA0 ] attn_norm-24 ( 15K) [CUDA0 ] node #877 ( RMS_NORM): norm-24 ( 8K) [CUDA0 ]: Kcur-24 (reshaped) ( 8K) [CUDA0 ] node #878 ( MUL): Kcur_normed-24 ( 8K) [CUDA0 ]: norm-24 ( 8K) [CUDA0 ] blk.24.attn_k_norm.w ( 1K) [CUDA0 ] node #879 ( ROPE): Kcur-24 ( 8K) [CUDA0 ]: Kcur_normed-24 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #880 ( MUL_MAT): Vcur-24 ( 8K) [CUDA0 ]: blk.24.attn_v.weight ( 4M) [CUDA0 ] attn_norm-24 ( 15K) [CUDA0 ] node #882 ( CPY): k_cache_view-24 (cop ( 2K) [CUDA0 ]: Kcur-24 ( 8K) [CUDA0 ] k_cache_view-24 ( 2K) [CUDA0 ] node #884 ( CPY): v_cache_view-24 (cop ( 2K) [CUDA0 ]: Vcur-24 ( 8K) [CUDA0 ] v_cache_view-24 ( 2K) [CUDA0 ]
SPLIT #50: CPU # 3 inputs: [q-24 ( 16K)] [k-24 ( 544K)] [v-24 ( 544K)]
node #888 (FLASH_ATTN): node_888 ( 16K) [ CPU ]: CPU#q-24#0 ( 16K) [ NULL ] CPU#k-24#0 ( 544K) [ NULL ] CPU#v-24#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #51: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #890 ( MUL_MAT): kqv_out-24 ( 15K) [CUDA0 ]: blk.24.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #891 ( RMS_NORM): norm-24 ( 15K) [CUDA0 ]: kqv_out-24 ( 15K) [CUDA0 ] node #892 ( MUL): attn_post_norm-24 ( 15K) [CUDA0 ]: norm-24 ( 15K) [CUDA0 ] blk.24.post_attentio ( 15K) [CUDA0 ] node #893 ( ADD): sa_out-24 ( 15K) [CUDA0 ]: attn_post_norm-24 ( 15K) [CUDA0 ] l_out-23 ( 15K) [CUDA0 ] node #894 ( RMS_NORM): norm-24 ( 15K) [CUDA0 ]: sa_out-24 ( 15K) [CUDA0 ] node #895 ( MUL): ffn_norm-24 ( 15K) [CUDA0 ]: norm-24 ( 15K) [CUDA0 ] blk.24.ffn_norm.weig ( 15K) [CUDA0 ] node #896 ( MUL_MAT): ffn_gate-24 ( 60K) [CUDA0 ]: blk.24.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-24 ( 15K) [CUDA0 ] node #897 ( UNARY): ffn_gelu-24 ( 60K) [CUDA0 ]: ffn_gate-24 ( 60K) [CUDA0 ] node #898 ( MUL_MAT): ffn_up-24 ( 60K) [CUDA0 ]: blk.24.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-24 ( 15K) [CUDA0 ] node #899 ( MUL): ffn_gate_par-24 ( 60K) [CUDA0 ]: ffn_gelu-24 ( 60K) [CUDA0 ] ffn_up-24 ( 60K) [CUDA0 ] node #900 ( MUL_MAT): ffn_out-24 ( 15K) [CUDA0 ]: blk.24.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-24 ( 60K) [CUDA0 ] node #901 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-24 ( 15K) [CUDA0 ] node #902 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.24.post_ffw_norm ( 15K) [CUDA0 ] node #903 ( ADD): l_out-24 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-24 ( 15K) [CUDA0 ] node #904 ( RMS_NORM): norm-25 ( 15K) [CUDA0 ]: l_out-24 ( 15K) [CUDA0 ] node #905 ( MUL): attn_norm-25 ( 15K) [CUDA0 ]: norm-25 ( 15K) [CUDA0 ] blk.25.attn_norm.wei ( 15K) [CUDA0 ] node #906 ( MUL_MAT): Qcur-25 ( 16K) [CUDA0 ]: blk.25.attn_q.weight ( 8M) [CUDA0 ] attn_norm-25 ( 15K) [CUDA0 ] node #908 ( RMS_NORM): norm-25 ( 16K) [CUDA0 ]: Qcur-25 (reshaped) ( 16K) [CUDA0 ] node #909 ( MUL): Qcur_normed-25 ( 16K) [CUDA0 ]: norm-25 ( 16K) [CUDA0 ] blk.25.attn_q_norm.w ( 1K) [CUDA0 ] node #910 ( ROPE): Qcur-25 ( 16K) [CUDA0 ]: Qcur_normed-25 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #911 ( MUL_MAT): Kcur-25 ( 8K) [CUDA0 ]: blk.25.attn_k.weight ( 4M) [CUDA0 ] attn_norm-25 ( 15K) [CUDA0 ] node #913 ( RMS_NORM): norm-25 ( 8K) [CUDA0 ]: Kcur-25 (reshaped) ( 8K) [CUDA0 ] node #914 ( MUL): Kcur_normed-25 ( 8K) [CUDA0 ]: norm-25 ( 8K) [CUDA0 ] blk.25.attn_k_norm.w ( 1K) [CUDA0 ] node #915 ( ROPE): Kcur-25 ( 8K) [CUDA0 ]: Kcur_normed-25 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #916 ( MUL_MAT): Vcur-25 ( 8K) [CUDA0 ]: blk.25.attn_v.weight ( 4M) [CUDA0 ] attn_norm-25 ( 15K) [CUDA0 ] node #918 ( CPY): k_cache_view-25 (cop ( 2K) [CUDA0 ]: Kcur-25 ( 8K) [CUDA0 ] k_cache_view-25 ( 2K) [CUDA0 ] node #920 ( CPY): v_cache_view-25 (cop ( 2K) [CUDA0 ]: Vcur-25 ( 8K) [CUDA0 ] v_cache_view-25 ( 2K) [CUDA0 ]
SPLIT #52: CPU # 3 inputs: [q-25 ( 16K)] [k-25 ( 544K)] [v-25 ( 544K)]
node #924 (FLASH_ATTN): node_924 ( 16K) [ CPU ]: CPU#q-25#0 ( 16K) [ NULL ] CPU#k-25#0 ( 544K) [ NULL ] CPU#v-25#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #53: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #926 ( MUL_MAT): kqv_out-25 ( 15K) [CUDA0 ]: blk.25.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #927 ( RMS_NORM): norm-25 ( 15K) [CUDA0 ]: kqv_out-25 ( 15K) [CUDA0 ] node #928 ( MUL): attn_post_norm-25 ( 15K) [CUDA0 ]: norm-25 ( 15K) [CUDA0 ] blk.25.post_attentio ( 15K) [CUDA0 ] node #929 ( ADD): sa_out-25 ( 15K) [CUDA0 ]: attn_post_norm-25 ( 15K) [CUDA0 ] l_out-24 ( 15K) [CUDA0 ] node #930 ( RMS_NORM): norm-25 ( 15K) [CUDA0 ]: sa_out-25 ( 15K) [CUDA0 ] node #931 ( MUL): ffn_norm-25 ( 15K) [CUDA0 ]: norm-25 ( 15K) [CUDA0 ] blk.25.ffn_norm.weig ( 15K) [CUDA0 ] node #932 ( MUL_MAT): ffn_gate-25 ( 60K) [CUDA0 ]: blk.25.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-25 ( 15K) [CUDA0 ] node #933 ( UNARY): ffn_gelu-25 ( 60K) [CUDA0 ]: ffn_gate-25 ( 60K) [CUDA0 ] node #934 ( MUL_MAT): ffn_up-25 ( 60K) [CUDA0 ]: blk.25.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-25 ( 15K) [CUDA0 ] node #935 ( MUL): ffn_gate_par-25 ( 60K) [CUDA0 ]: ffn_gelu-25 ( 60K) [CUDA0 ] ffn_up-25 ( 60K) [CUDA0 ] node #936 ( MUL_MAT): ffn_out-25 ( 15K) [CUDA0 ]: blk.25.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-25 ( 60K) [CUDA0 ] node #937 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-25 ( 15K) [CUDA0 ] node #938 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.25.post_ffw_norm ( 15K) [CUDA0 ] node #939 ( ADD): l_out-25 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-25 ( 15K) [CUDA0 ] node #940 ( RMS_NORM): norm-26 ( 15K) [CUDA0 ]: l_out-25 ( 15K) [CUDA0 ] node #941 ( MUL): attn_norm-26 ( 15K) [CUDA0 ]: norm-26 ( 15K) [CUDA0 ] blk.26.attn_norm.wei ( 15K) [CUDA0 ] node #942 ( MUL_MAT): Qcur-26 ( 16K) [CUDA0 ]: blk.26.attn_q.weight ( 8M) [CUDA0 ] attn_norm-26 ( 15K) [CUDA0 ] node #944 ( RMS_NORM): norm-26 ( 16K) [CUDA0 ]: Qcur-26 (reshaped) ( 16K) [CUDA0 ] node #945 ( MUL): Qcur_normed-26 ( 16K) [CUDA0 ]: norm-26 ( 16K) [CUDA0 ] blk.26.attn_q_norm.w ( 1K) [CUDA0 ] node #946 ( ROPE): Qcur-26 ( 16K) [CUDA0 ]: Qcur_normed-26 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #947 ( MUL_MAT): Kcur-26 ( 8K) [CUDA0 ]: blk.26.attn_k.weight ( 4M) [CUDA0 ] attn_norm-26 ( 15K) [CUDA0 ] node #949 ( RMS_NORM): norm-26 ( 8K) [CUDA0 ]: Kcur-26 (reshaped) ( 8K) [CUDA0 ] node #950 ( MUL): Kcur_normed-26 ( 8K) [CUDA0 ]: norm-26 ( 8K) [CUDA0 ] blk.26.attn_k_norm.w ( 1K) [CUDA0 ] node #951 ( ROPE): Kcur-26 ( 8K) [CUDA0 ]: Kcur_normed-26 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #952 ( MUL_MAT): Vcur-26 ( 8K) [CUDA0 ]: blk.26.attn_v.weight ( 6M) [CUDA0 ] attn_norm-26 ( 15K) [CUDA0 ] node #954 ( CPY): k_cache_view-26 (cop ( 2K) [CUDA0 ]: Kcur-26 ( 8K) [CUDA0 ] k_cache_view-26 ( 2K) [CUDA0 ] node #956 ( CPY): v_cache_view-26 (cop ( 2K) [CUDA0 ]: Vcur-26 ( 8K) [CUDA0 ] v_cache_view-26 ( 2K) [CUDA0 ]
SPLIT #54: CPU # 3 inputs: [q-26 ( 16K)] [k-26 ( 544K)] [v-26 ( 544K)]
node #960 (FLASH_ATTN): node_960 ( 16K) [ CPU ]: CPU#q-26#0 ( 16K) [ NULL ] CPU#k-26#0 ( 544K) [ NULL ] CPU#v-26#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #55: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #962 ( MUL_MAT): kqv_out-26 ( 15K) [CUDA0 ]: blk.26.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #963 ( RMS_NORM): norm-26 ( 15K) [CUDA0 ]: kqv_out-26 ( 15K) [CUDA0 ] node #964 ( MUL): attn_post_norm-26 ( 15K) [CUDA0 ]: norm-26 ( 15K) [CUDA0 ] blk.26.post_attentio ( 15K) [CUDA0 ] node #965 ( ADD): sa_out-26 ( 15K) [CUDA0 ]: attn_post_norm-26 ( 15K) [CUDA0 ] l_out-25 ( 15K) [CUDA0 ] node #966 ( RMS_NORM): norm-26 ( 15K) [CUDA0 ]: sa_out-26 ( 15K) [CUDA0 ] node #967 ( MUL): ffn_norm-26 ( 15K) [CUDA0 ]: norm-26 ( 15K) [CUDA0 ] blk.26.ffn_norm.weig ( 15K) [CUDA0 ] node #968 ( MUL_MAT): ffn_gate-26 ( 60K) [CUDA0 ]: blk.26.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-26 ( 15K) [CUDA0 ] node #969 ( UNARY): ffn_gelu-26 ( 60K) [CUDA0 ]: ffn_gate-26 ( 60K) [CUDA0 ] node #970 ( MUL_MAT): ffn_up-26 ( 60K) [CUDA0 ]: blk.26.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-26 ( 15K) [CUDA0 ] node #971 ( MUL): ffn_gate_par-26 ( 60K) [CUDA0 ]: ffn_gelu-26 ( 60K) [CUDA0 ] ffn_up-26 ( 60K) [CUDA0 ] node #972 ( MUL_MAT): ffn_out-26 ( 15K) [CUDA0 ]: blk.26.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-26 ( 60K) [CUDA0 ] node #973 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-26 ( 15K) [CUDA0 ] node #974 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.26.post_ffw_norm ( 15K) [CUDA0 ] node #975 ( ADD): l_out-26 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-26 ( 15K) [CUDA0 ] node #976 ( RMS_NORM): norm-27 ( 15K) [CUDA0 ]: l_out-26 ( 15K) [CUDA0 ] node #977 ( MUL): attn_norm-27 ( 15K) [CUDA0 ]: norm-27 ( 15K) [CUDA0 ] blk.27.attn_norm.wei ( 15K) [CUDA0 ] node #978 ( MUL_MAT): Qcur-27 ( 16K) [CUDA0 ]: blk.27.attn_q.weight ( 8M) [CUDA0 ] attn_norm-27 ( 15K) [CUDA0 ] node #980 ( RMS_NORM): norm-27 ( 16K) [CUDA0 ]: Qcur-27 (reshaped) ( 16K) [CUDA0 ] node #981 ( MUL): Qcur_normed-27 ( 16K) [CUDA0 ]: norm-27 ( 16K) [CUDA0 ] blk.27.attn_q_norm.w ( 1K) [CUDA0 ] node #982 ( ROPE): Qcur-27 ( 16K) [CUDA0 ]: Qcur_normed-27 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #983 ( MUL_MAT): Kcur-27 ( 8K) [CUDA0 ]: blk.27.attn_k.weight ( 4M) [CUDA0 ] attn_norm-27 ( 15K) [CUDA0 ] node #985 ( RMS_NORM): norm-27 ( 8K) [CUDA0 ]: Kcur-27 (reshaped) ( 8K) [CUDA0 ] node #986 ( MUL): Kcur_normed-27 ( 8K) [CUDA0 ]: norm-27 ( 8K) [CUDA0 ] blk.27.attn_k_norm.w ( 1K) [CUDA0 ] node #987 ( ROPE): Kcur-27 ( 8K) [CUDA0 ]: Kcur_normed-27 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #988 ( MUL_MAT): Vcur-27 ( 8K) [CUDA0 ]: blk.27.attn_v.weight ( 4M) [CUDA0 ] attn_norm-27 ( 15K) [CUDA0 ] node #990 ( CPY): k_cache_view-27 (cop ( 2K) [CUDA0 ]: Kcur-27 ( 8K) [CUDA0 ] k_cache_view-27 ( 2K) [CUDA0 ] node #992 ( CPY): v_cache_view-27 (cop ( 2K) [CUDA0 ]: Vcur-27 ( 8K) [CUDA0 ] v_cache_view-27 ( 2K) [CUDA0 ]
SPLIT #56: CPU # 3 inputs: [q-27 ( 16K)] [k-27 ( 544K)] [v-27 ( 544K)]
node #996 (FLASH_ATTN): node_996 ( 16K) [ CPU ]: CPU#q-27#0 ( 16K) [ NULL ] CPU#k-27#0 ( 544K) [ NULL ] CPU#v-27#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #57: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #998 ( MUL_MAT): kqv_out-27 ( 15K) [CUDA0 ]: blk.27.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #999 ( RMS_NORM): norm-27 ( 15K) [CUDA0 ]: kqv_out-27 ( 15K) [CUDA0 ] node #1000 ( MUL): attn_post_norm-27 ( 15K) [CUDA0 ]: norm-27 ( 15K) [CUDA0 ] blk.27.post_attentio ( 15K) [CUDA0 ] node #1001 ( ADD): sa_out-27 ( 15K) [CUDA0 ]: attn_post_norm-27 ( 15K) [CUDA0 ] l_out-26 ( 15K) [CUDA0 ] node #1002 ( RMS_NORM): norm-27 ( 15K) [CUDA0 ]: sa_out-27 ( 15K) [CUDA0 ] node #1003 ( MUL): ffn_norm-27 ( 15K) [CUDA0 ]: norm-27 ( 15K) [CUDA0 ] blk.27.ffn_norm.weig ( 15K) [CUDA0 ] node #1004 ( MUL_MAT): ffn_gate-27 ( 60K) [CUDA0 ]: blk.27.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-27 ( 15K) [CUDA0 ] node #1005 ( UNARY): ffn_gelu-27 ( 60K) [CUDA0 ]: ffn_gate-27 ( 60K) [CUDA0 ] node #1006 ( MUL_MAT): ffn_up-27 ( 60K) [CUDA0 ]: blk.27.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-27 ( 15K) [CUDA0 ] node #1007 ( MUL): ffn_gate_par-27 ( 60K) [CUDA0 ]: ffn_gelu-27 ( 60K) [CUDA0 ] ffn_up-27 ( 60K) [CUDA0 ] node #1008 ( MUL_MAT): ffn_out-27 ( 15K) [CUDA0 ]: blk.27.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-27 ( 60K) [CUDA0 ] node #1009 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-27 ( 15K) [CUDA0 ] node #1010 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.27.post_ffw_norm ( 15K) [CUDA0 ] node #1011 ( ADD): l_out-27 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-27 ( 15K) [CUDA0 ] node #1012 ( RMS_NORM): norm-28 ( 15K) [CUDA0 ]: l_out-27 ( 15K) [CUDA0 ] node #1013 ( MUL): attn_norm-28 ( 15K) [CUDA0 ]: norm-28 ( 15K) [CUDA0 ] blk.28.attn_norm.wei ( 15K) [CUDA0 ] node #1014 ( MUL_MAT): Qcur-28 ( 16K) [CUDA0 ]: blk.28.attn_q.weight ( 8M) [CUDA0 ] attn_norm-28 ( 15K) [CUDA0 ] node #1016 ( RMS_NORM): norm-28 ( 16K) [CUDA0 ]: Qcur-28 (reshaped) ( 16K) [CUDA0 ] node #1017 ( MUL): Qcur_normed-28 ( 16K) [CUDA0 ]: norm-28 ( 16K) [CUDA0 ] blk.28.attn_q_norm.w ( 1K) [CUDA0 ] node #1018 ( ROPE): Qcur-28 ( 16K) [CUDA0 ]: Qcur_normed-28 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1019 ( MUL_MAT): Kcur-28 ( 8K) [CUDA0 ]: blk.28.attn_k.weight ( 4M) [CUDA0 ] attn_norm-28 ( 15K) [CUDA0 ] node #1021 ( RMS_NORM): norm-28 ( 8K) [CUDA0 ]: Kcur-28 (reshaped) ( 8K) [CUDA0 ] node #1022 ( MUL): Kcur_normed-28 ( 8K) [CUDA0 ]: norm-28 ( 8K) [CUDA0 ] blk.28.attn_k_norm.w ( 1K) [CUDA0 ] node #1023 ( ROPE): Kcur-28 ( 8K) [CUDA0 ]: Kcur_normed-28 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1024 ( MUL_MAT): Vcur-28 ( 8K) [CUDA0 ]: blk.28.attn_v.weight ( 4M) [CUDA0 ] attn_norm-28 ( 15K) [CUDA0 ] node #1026 ( CPY): k_cache_view-28 (cop ( 2K) [CUDA0 ]: Kcur-28 ( 8K) [CUDA0 ] k_cache_view-28 ( 2K) [CUDA0 ] node #1028 ( CPY): v_cache_view-28 (cop ( 2K) [CUDA0 ]: Vcur-28 ( 8K) [CUDA0 ] v_cache_view-28 ( 2K) [CUDA0 ]
SPLIT #58: CPU # 3 inputs: [q-28 ( 16K)] [k-28 ( 544K)] [v-28 ( 544K)]
node #1032 (FLASH_ATTN): node_1032 ( 16K) [ CPU ]: CPU#q-28#0 ( 16K) [ NULL ] CPU#k-28#0 ( 544K) [ NULL ] CPU#v-28#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #59: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1034 ( MUL_MAT): kqv_out-28 ( 15K) [CUDA0 ]: blk.28.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1035 ( RMS_NORM): norm-28 ( 15K) [CUDA0 ]: kqv_out-28 ( 15K) [CUDA0 ] node #1036 ( MUL): attn_post_norm-28 ( 15K) [CUDA0 ]: norm-28 ( 15K) [CUDA0 ] blk.28.post_attentio ( 15K) [CUDA0 ] node #1037 ( ADD): sa_out-28 ( 15K) [CUDA0 ]: attn_post_norm-28 ( 15K) [CUDA0 ] l_out-27 ( 15K) [CUDA0 ] node #1038 ( RMS_NORM): norm-28 ( 15K) [CUDA0 ]: sa_out-28 ( 15K) [CUDA0 ] node #1039 ( MUL): ffn_norm-28 ( 15K) [CUDA0 ]: norm-28 ( 15K) [CUDA0 ] blk.28.ffn_norm.weig ( 15K) [CUDA0 ] node #1040 ( MUL_MAT): ffn_gate-28 ( 60K) [CUDA0 ]: blk.28.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-28 ( 15K) [CUDA0 ] node #1041 ( UNARY): ffn_gelu-28 ( 60K) [CUDA0 ]: ffn_gate-28 ( 60K) [CUDA0 ] node #1042 ( MUL_MAT): ffn_up-28 ( 60K) [CUDA0 ]: blk.28.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-28 ( 15K) [CUDA0 ] node #1043 ( MUL): ffn_gate_par-28 ( 60K) [CUDA0 ]: ffn_gelu-28 ( 60K) [CUDA0 ] ffn_up-28 ( 60K) [CUDA0 ] node #1044 ( MUL_MAT): ffn_out-28 ( 15K) [CUDA0 ]: blk.28.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-28 ( 60K) [CUDA0 ] node #1045 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-28 ( 15K) [CUDA0 ] node #1046 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.28.post_ffw_norm ( 15K) [CUDA0 ] node #1047 ( ADD): l_out-28 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-28 ( 15K) [CUDA0 ] node #1048 ( RMS_NORM): norm-29 ( 15K) [CUDA0 ]: l_out-28 ( 15K) [CUDA0 ] node #1049 ( MUL): attn_norm-29 ( 15K) [CUDA0 ]: norm-29 ( 15K) [CUDA0 ] blk.29.attn_norm.wei ( 15K) [CUDA0 ] node #1050 ( MUL_MAT): Qcur-29 ( 16K) [CUDA0 ]: blk.29.attn_q.weight ( 8M) [CUDA0 ] attn_norm-29 ( 15K) [CUDA0 ] node #1052 ( RMS_NORM): norm-29 ( 16K) [CUDA0 ]: Qcur-29 (reshaped) ( 16K) [CUDA0 ] node #1053 ( MUL): Qcur_normed-29 ( 16K) [CUDA0 ]: norm-29 ( 16K) [CUDA0 ] blk.29.attn_q_norm.w ( 1K) [CUDA0 ] node #1054 ( ROPE): Qcur-29 ( 16K) [CUDA0 ]: Qcur_normed-29 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1055 ( MUL_MAT): Kcur-29 ( 8K) [CUDA0 ]: blk.29.attn_k.weight ( 4M) [CUDA0 ] attn_norm-29 ( 15K) [CUDA0 ] node #1057 ( RMS_NORM): norm-29 ( 8K) [CUDA0 ]: Kcur-29 (reshaped) ( 8K) [CUDA0 ] node #1058 ( MUL): Kcur_normed-29 ( 8K) [CUDA0 ]: norm-29 ( 8K) [CUDA0 ] blk.29.attn_k_norm.w ( 1K) [CUDA0 ] node #1059 ( ROPE): Kcur-29 ( 8K) [CUDA0 ]: Kcur_normed-29 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1060 ( MUL_MAT): Vcur-29 ( 8K) [CUDA0 ]: blk.29.attn_v.weight ( 6M) [CUDA0 ] attn_norm-29 ( 15K) [CUDA0 ] node #1062 ( CPY): k_cache_view-29 (cop ( 2K) [CUDA0 ]: Kcur-29 ( 8K) [CUDA0 ] k_cache_view-29 ( 2K) [CUDA0 ] node #1064 ( CPY): v_cache_view-29 (cop ( 2K) [CUDA0 ]: Vcur-29 ( 8K) [CUDA0 ] v_cache_view-29 ( 2K) [CUDA0 ]
SPLIT #60: CPU # 3 inputs: [q-29 ( 16K)] [k-29 ( 544K)] [v-29 ( 544K)]
node #1068 (FLASH_ATTN): node_1068 ( 16K) [ CPU ]: CPU#q-29#0 ( 16K) [ NULL ] CPU#k-29#0 ( 544K) [ NULL ] CPU#v-29#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #61: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1070 ( MUL_MAT): kqv_out-29 ( 15K) [CUDA0 ]: blk.29.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1071 ( RMS_NORM): norm-29 ( 15K) [CUDA0 ]: kqv_out-29 ( 15K) [CUDA0 ] node #1072 ( MUL): attn_post_norm-29 ( 15K) [CUDA0 ]: norm-29 ( 15K) [CUDA0 ] blk.29.post_attentio ( 15K) [CUDA0 ] node #1073 ( ADD): sa_out-29 ( 15K) [CUDA0 ]: attn_post_norm-29 ( 15K) [CUDA0 ] l_out-28 ( 15K) [CUDA0 ] node #1074 ( RMS_NORM): norm-29 ( 15K) [CUDA0 ]: sa_out-29 ( 15K) [CUDA0 ] node #1075 ( MUL): ffn_norm-29 ( 15K) [CUDA0 ]: norm-29 ( 15K) [CUDA0 ] blk.29.ffn_norm.weig ( 15K) [CUDA0 ] node #1076 ( MUL_MAT): ffn_gate-29 ( 60K) [CUDA0 ]: blk.29.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-29 ( 15K) [CUDA0 ] node #1077 ( UNARY): ffn_gelu-29 ( 60K) [CUDA0 ]: ffn_gate-29 ( 60K) [CUDA0 ] node #1078 ( MUL_MAT): ffn_up-29 ( 60K) [CUDA0 ]: blk.29.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-29 ( 15K) [CUDA0 ] node #1079 ( MUL): ffn_gate_par-29 ( 60K) [CUDA0 ]: ffn_gelu-29 ( 60K) [CUDA0 ] ffn_up-29 ( 60K) [CUDA0 ] node #1080 ( MUL_MAT): ffn_out-29 ( 15K) [CUDA0 ]: blk.29.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-29 ( 60K) [CUDA0 ] node #1081 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-29 ( 15K) [CUDA0 ] node #1082 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.29.post_ffw_norm ( 15K) [CUDA0 ] node #1083 ( ADD): l_out-29 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-29 ( 15K) [CUDA0 ] node #1084 ( RMS_NORM): norm-30 ( 15K) [CUDA0 ]: l_out-29 ( 15K) [CUDA0 ] node #1085 ( MUL): attn_norm-30 ( 15K) [CUDA0 ]: norm-30 ( 15K) [CUDA0 ] blk.30.attn_norm.wei ( 15K) [CUDA0 ] node #1086 ( MUL_MAT): Qcur-30 ( 16K) [CUDA0 ]: blk.30.attn_q.weight ( 8M) [CUDA0 ] attn_norm-30 ( 15K) [CUDA0 ] node #1088 ( RMS_NORM): norm-30 ( 16K) [CUDA0 ]: Qcur-30 (reshaped) ( 16K) [CUDA0 ] node #1089 ( MUL): Qcur_normed-30 ( 16K) [CUDA0 ]: norm-30 ( 16K) [CUDA0 ] blk.30.attn_q_norm.w ( 1K) [CUDA0 ] node #1090 ( ROPE): Qcur-30 ( 16K) [CUDA0 ]: Qcur_normed-30 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1091 ( MUL_MAT): Kcur-30 ( 8K) [CUDA0 ]: blk.30.attn_k.weight ( 4M) [CUDA0 ] attn_norm-30 ( 15K) [CUDA0 ] node #1093 ( RMS_NORM): norm-30 ( 8K) [CUDA0 ]: Kcur-30 (reshaped) ( 8K) [CUDA0 ] node #1094 ( MUL): Kcur_normed-30 ( 8K) [CUDA0 ]: norm-30 ( 8K) [CUDA0 ] blk.30.attn_k_norm.w ( 1K) [CUDA0 ] node #1095 ( ROPE): Kcur-30 ( 8K) [CUDA0 ]: Kcur_normed-30 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1096 ( MUL_MAT): Vcur-30 ( 8K) [CUDA0 ]: blk.30.attn_v.weight ( 4M) [CUDA0 ] attn_norm-30 ( 15K) [CUDA0 ] node #1098 ( CPY): k_cache_view-30 (cop ( 2K) [CUDA0 ]: Kcur-30 ( 8K) [CUDA0 ] k_cache_view-30 ( 2K) [CUDA0 ] node #1100 ( CPY): v_cache_view-30 (cop ( 2K) [CUDA0 ]: Vcur-30 ( 8K) [CUDA0 ] v_cache_view-30 ( 2K) [CUDA0 ]
SPLIT #62: CPU # 3 inputs: [q-30 ( 16K)] [k-30 ( 544K)] [v-30 ( 544K)]
node #1104 (FLASH_ATTN): node_1104 ( 16K) [ CPU ]: CPU#q-30#0 ( 16K) [ NULL ] CPU#k-30#0 ( 544K) [ NULL ] CPU#v-30#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #63: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1106 ( MUL_MAT): kqv_out-30 ( 15K) [CUDA0 ]: blk.30.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1107 ( RMS_NORM): norm-30 ( 15K) [CUDA0 ]: kqv_out-30 ( 15K) [CUDA0 ] node #1108 ( MUL): attn_post_norm-30 ( 15K) [CUDA0 ]: norm-30 ( 15K) [CUDA0 ] blk.30.post_attentio ( 15K) [CUDA0 ] node #1109 ( ADD): sa_out-30 ( 15K) [CUDA0 ]: attn_post_norm-30 ( 15K) [CUDA0 ] l_out-29 ( 15K) [CUDA0 ] node #1110 ( RMS_NORM): norm-30 ( 15K) [CUDA0 ]: sa_out-30 ( 15K) [CUDA0 ] node #1111 ( MUL): ffn_norm-30 ( 15K) [CUDA0 ]: norm-30 ( 15K) [CUDA0 ] blk.30.ffn_norm.weig ( 15K) [CUDA0 ] node #1112 ( MUL_MAT): ffn_gate-30 ( 60K) [CUDA0 ]: blk.30.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-30 ( 15K) [CUDA0 ] node #1113 ( UNARY): ffn_gelu-30 ( 60K) [CUDA0 ]: ffn_gate-30 ( 60K) [CUDA0 ] node #1114 ( MUL_MAT): ffn_up-30 ( 60K) [CUDA0 ]: blk.30.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-30 ( 15K) [CUDA0 ] node #1115 ( MUL): ffn_gate_par-30 ( 60K) [CUDA0 ]: ffn_gelu-30 ( 60K) [CUDA0 ] ffn_up-30 ( 60K) [CUDA0 ] node #1116 ( MUL_MAT): ffn_out-30 ( 15K) [CUDA0 ]: blk.30.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-30 ( 60K) [CUDA0 ] node #1117 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-30 ( 15K) [CUDA0 ] node #1118 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.30.post_ffw_norm ( 15K) [CUDA0 ] node #1119 ( ADD): l_out-30 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-30 ( 15K) [CUDA0 ] node #1120 ( RMS_NORM): norm-31 ( 15K) [CUDA0 ]: l_out-30 ( 15K) [CUDA0 ] node #1121 ( MUL): attn_norm-31 ( 15K) [CUDA0 ]: norm-31 ( 15K) [CUDA0 ] blk.31.attn_norm.wei ( 15K) [CUDA0 ] node #1122 ( MUL_MAT): Qcur-31 ( 16K) [CUDA0 ]: blk.31.attn_q.weight ( 8M) [CUDA0 ] attn_norm-31 ( 15K) [CUDA0 ] node #1124 ( RMS_NORM): norm-31 ( 16K) [CUDA0 ]: Qcur-31 (reshaped) ( 16K) [CUDA0 ] node #1125 ( MUL): Qcur_normed-31 ( 16K) [CUDA0 ]: norm-31 ( 16K) [CUDA0 ] blk.31.attn_q_norm.w ( 1K) [CUDA0 ] node #1126 ( ROPE): Qcur-31 ( 16K) [CUDA0 ]: Qcur_normed-31 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1127 ( MUL_MAT): Kcur-31 ( 8K) [CUDA0 ]: blk.31.attn_k.weight ( 4M) [CUDA0 ] attn_norm-31 ( 15K) [CUDA0 ] node #1129 ( RMS_NORM): norm-31 ( 8K) [CUDA0 ]: Kcur-31 (reshaped) ( 8K) [CUDA0 ] node #1130 ( MUL): Kcur_normed-31 ( 8K) [CUDA0 ]: norm-31 ( 8K) [CUDA0 ] blk.31.attn_k_norm.w ( 1K) [CUDA0 ] node #1131 ( ROPE): Kcur-31 ( 8K) [CUDA0 ]: Kcur_normed-31 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1132 ( MUL_MAT): Vcur-31 ( 8K) [CUDA0 ]: blk.31.attn_v.weight ( 4M) [CUDA0 ] attn_norm-31 ( 15K) [CUDA0 ] node #1134 ( CPY): k_cache_view-31 (cop ( 2K) [CUDA0 ]: Kcur-31 ( 8K) [CUDA0 ] k_cache_view-31 ( 2K) [CUDA0 ] node #1136 ( CPY): v_cache_view-31 (cop ( 2K) [CUDA0 ]: Vcur-31 ( 8K) [CUDA0 ] v_cache_view-31 ( 2K) [CUDA0 ]
SPLIT #64: CPU # 3 inputs: [q-31 ( 16K)] [k-31 ( 544K)] [v-31 ( 544K)]
node #1140 (FLASH_ATTN): node_1140 ( 16K) [ CPU ]: CPU#q-31#0 ( 16K) [ NULL ] CPU#k-31#0 ( 544K) [ NULL ] CPU#v-31#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #65: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1142 ( MUL_MAT): kqv_out-31 ( 15K) [CUDA0 ]: blk.31.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1143 ( RMS_NORM): norm-31 ( 15K) [CUDA0 ]: kqv_out-31 ( 15K) [CUDA0 ] node #1144 ( MUL): attn_post_norm-31 ( 15K) [CUDA0 ]: norm-31 ( 15K) [CUDA0 ] blk.31.post_attentio ( 15K) [CUDA0 ] node #1145 ( ADD): sa_out-31 ( 15K) [CUDA0 ]: attn_post_norm-31 ( 15K) [CUDA0 ] l_out-30 ( 15K) [CUDA0 ] node #1146 ( RMS_NORM): norm-31 ( 15K) [CUDA0 ]: sa_out-31 ( 15K) [CUDA0 ] node #1147 ( MUL): ffn_norm-31 ( 15K) [CUDA0 ]: norm-31 ( 15K) [CUDA0 ] blk.31.ffn_norm.weig ( 15K) [CUDA0 ] node #1148 ( MUL_MAT): ffn_gate-31 ( 60K) [CUDA0 ]: blk.31.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-31 ( 15K) [CUDA0 ] node #1149 ( UNARY): ffn_gelu-31 ( 60K) [CUDA0 ]: ffn_gate-31 ( 60K) [CUDA0 ] node #1150 ( MUL_MAT): ffn_up-31 ( 60K) [CUDA0 ]: blk.31.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-31 ( 15K) [CUDA0 ] node #1151 ( MUL): ffn_gate_par-31 ( 60K) [CUDA0 ]: ffn_gelu-31 ( 60K) [CUDA0 ] ffn_up-31 ( 60K) [CUDA0 ] node #1152 ( MUL_MAT): ffn_out-31 ( 15K) [CUDA0 ]: blk.31.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-31 ( 60K) [CUDA0 ] node #1153 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-31 ( 15K) [CUDA0 ] node #1154 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.31.post_ffw_norm ( 15K) [CUDA0 ] node #1155 ( ADD): l_out-31 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-31 ( 15K) [CUDA0 ] node #1156 ( RMS_NORM): norm-32 ( 15K) [CUDA0 ]: l_out-31 ( 15K) [CUDA0 ] node #1157 ( MUL): attn_norm-32 ( 15K) [CUDA0 ]: norm-32 ( 15K) [CUDA0 ] blk.32.attn_norm.wei ( 15K) [CUDA0 ] node #1158 ( MUL_MAT): Qcur-32 ( 16K) [CUDA0 ]: blk.32.attn_q.weight ( 8M) [CUDA0 ] attn_norm-32 ( 15K) [CUDA0 ] node #1160 ( RMS_NORM): norm-32 ( 16K) [CUDA0 ]: Qcur-32 (reshaped) ( 16K) [CUDA0 ] node #1161 ( MUL): Qcur_normed-32 ( 16K) [CUDA0 ]: norm-32 ( 16K) [CUDA0 ] blk.32.attn_q_norm.w ( 1K) [CUDA0 ] node #1162 ( ROPE): Qcur-32 ( 16K) [CUDA0 ]: Qcur_normed-32 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1163 ( MUL_MAT): Kcur-32 ( 8K) [CUDA0 ]: blk.32.attn_k.weight ( 4M) [CUDA0 ] attn_norm-32 ( 15K) [CUDA0 ] node #1165 ( RMS_NORM): norm-32 ( 8K) [CUDA0 ]: Kcur-32 (reshaped) ( 8K) [CUDA0 ] node #1166 ( MUL): Kcur_normed-32 ( 8K) [CUDA0 ]: norm-32 ( 8K) [CUDA0 ] blk.32.attn_k_norm.w ( 1K) [CUDA0 ] node #1167 ( ROPE): Kcur-32 ( 8K) [CUDA0 ]: Kcur_normed-32 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1168 ( MUL_MAT): Vcur-32 ( 8K) [CUDA0 ]: blk.32.attn_v.weight ( 6M) [CUDA0 ] attn_norm-32 ( 15K) [CUDA0 ] node #1170 ( CPY): k_cache_view-32 (cop ( 2K) [CUDA0 ]: Kcur-32 ( 8K) [CUDA0 ] k_cache_view-32 ( 2K) [CUDA0 ] node #1172 ( CPY): v_cache_view-32 (cop ( 2K) [CUDA0 ]: Vcur-32 ( 8K) [CUDA0 ] v_cache_view-32 ( 2K) [CUDA0 ]
SPLIT #66: CPU # 3 inputs: [q-32 ( 16K)] [k-32 ( 544K)] [v-32 ( 544K)]
node #1176 (FLASH_ATTN): node_1176 ( 16K) [ CPU ]: CPU#q-32#0 ( 16K) [ NULL ] CPU#k-32#0 ( 544K) [ NULL ] CPU#v-32#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #67: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1178 ( MUL_MAT): kqv_out-32 ( 15K) [CUDA0 ]: blk.32.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1179 ( RMS_NORM): norm-32 ( 15K) [CUDA0 ]: kqv_out-32 ( 15K) [CUDA0 ] node #1180 ( MUL): attn_post_norm-32 ( 15K) [CUDA0 ]: norm-32 ( 15K) [CUDA0 ] blk.32.post_attentio ( 15K) [CUDA0 ] node #1181 ( ADD): sa_out-32 ( 15K) [CUDA0 ]: attn_post_norm-32 ( 15K) [CUDA0 ] l_out-31 ( 15K) [CUDA0 ] node #1182 ( RMS_NORM): norm-32 ( 15K) [CUDA0 ]: sa_out-32 ( 15K) [CUDA0 ] node #1183 ( MUL): ffn_norm-32 ( 15K) [CUDA0 ]: norm-32 ( 15K) [CUDA0 ] blk.32.ffn_norm.weig ( 15K) [CUDA0 ] node #1184 ( MUL_MAT): ffn_gate-32 ( 60K) [CUDA0 ]: blk.32.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-32 ( 15K) [CUDA0 ] node #1185 ( UNARY): ffn_gelu-32 ( 60K) [CUDA0 ]: ffn_gate-32 ( 60K) [CUDA0 ] node #1186 ( MUL_MAT): ffn_up-32 ( 60K) [CUDA0 ]: blk.32.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-32 ( 15K) [CUDA0 ] node #1187 ( MUL): ffn_gate_par-32 ( 60K) [CUDA0 ]: ffn_gelu-32 ( 60K) [CUDA0 ] ffn_up-32 ( 60K) [CUDA0 ] node #1188 ( MUL_MAT): ffn_out-32 ( 15K) [CUDA0 ]: blk.32.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-32 ( 60K) [CUDA0 ] node #1189 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-32 ( 15K) [CUDA0 ] node #1190 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.32.post_ffw_norm ( 15K) [CUDA0 ] node #1191 ( ADD): l_out-32 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-32 ( 15K) [CUDA0 ] node #1192 ( RMS_NORM): norm-33 ( 15K) [CUDA0 ]: l_out-32 ( 15K) [CUDA0 ] node #1193 ( MUL): attn_norm-33 ( 15K) [CUDA0 ]: norm-33 ( 15K) [CUDA0 ] blk.33.attn_norm.wei ( 15K) [CUDA0 ] node #1194 ( MUL_MAT): Qcur-33 ( 16K) [CUDA0 ]: blk.33.attn_q.weight ( 8M) [CUDA0 ] attn_norm-33 ( 15K) [CUDA0 ] node #1196 ( RMS_NORM): norm-33 ( 16K) [CUDA0 ]: Qcur-33 (reshaped) ( 16K) [CUDA0 ] node #1197 ( MUL): Qcur_normed-33 ( 16K) [CUDA0 ]: norm-33 ( 16K) [CUDA0 ] blk.33.attn_q_norm.w ( 1K) [CUDA0 ] node #1198 ( ROPE): Qcur-33 ( 16K) [CUDA0 ]: Qcur_normed-33 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1199 ( MUL_MAT): Kcur-33 ( 8K) [CUDA0 ]: blk.33.attn_k.weight ( 4M) [CUDA0 ] attn_norm-33 ( 15K) [CUDA0 ] node #1201 ( RMS_NORM): norm-33 ( 8K) [CUDA0 ]: Kcur-33 (reshaped) ( 8K) [CUDA0 ] node #1202 ( MUL): Kcur_normed-33 ( 8K) [CUDA0 ]: norm-33 ( 8K) [CUDA0 ] blk.33.attn_k_norm.w ( 1K) [CUDA0 ] node #1203 ( ROPE): Kcur-33 ( 8K) [CUDA0 ]: Kcur_normed-33 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1204 ( MUL_MAT): Vcur-33 ( 8K) [CUDA0 ]: blk.33.attn_v.weight ( 4M) [CUDA0 ] attn_norm-33 ( 15K) [CUDA0 ] node #1206 ( CPY): k_cache_view-33 (cop ( 2K) [CUDA0 ]: Kcur-33 ( 8K) [CUDA0 ] k_cache_view-33 ( 2K) [CUDA0 ] node #1208 ( CPY): v_cache_view-33 (cop ( 2K) [CUDA0 ]: Vcur-33 ( 8K) [CUDA0 ] v_cache_view-33 ( 2K) [CUDA0 ]
SPLIT #68: CPU # 3 inputs: [q-33 ( 16K)] [k-33 ( 544K)] [v-33 ( 544K)]
node #1212 (FLASH_ATTN): node_1212 ( 16K) [ CPU ]: CPU#q-33#0 ( 16K) [ NULL ] CPU#k-33#0 ( 544K) [ NULL ] CPU#v-33#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #69: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1214 ( MUL_MAT): kqv_out-33 ( 15K) [CUDA0 ]: blk.33.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1215 ( RMS_NORM): norm-33 ( 15K) [CUDA0 ]: kqv_out-33 ( 15K) [CUDA0 ] node #1216 ( MUL): attn_post_norm-33 ( 15K) [CUDA0 ]: norm-33 ( 15K) [CUDA0 ] blk.33.post_attentio ( 15K) [CUDA0 ] node #1217 ( ADD): sa_out-33 ( 15K) [CUDA0 ]: attn_post_norm-33 ( 15K) [CUDA0 ] l_out-32 ( 15K) [CUDA0 ] node #1218 ( RMS_NORM): norm-33 ( 15K) [CUDA0 ]: sa_out-33 ( 15K) [CUDA0 ] node #1219 ( MUL): ffn_norm-33 ( 15K) [CUDA0 ]: norm-33 ( 15K) [CUDA0 ] blk.33.ffn_norm.weig ( 15K) [CUDA0 ] node #1220 ( MUL_MAT): ffn_gate-33 ( 60K) [CUDA0 ]: blk.33.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-33 ( 15K) [CUDA0 ] node #1221 ( UNARY): ffn_gelu-33 ( 60K) [CUDA0 ]: ffn_gate-33 ( 60K) [CUDA0 ] node #1222 ( MUL_MAT): ffn_up-33 ( 60K) [CUDA0 ]: blk.33.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-33 ( 15K) [CUDA0 ] node #1223 ( MUL): ffn_gate_par-33 ( 60K) [CUDA0 ]: ffn_gelu-33 ( 60K) [CUDA0 ] ffn_up-33 ( 60K) [CUDA0 ] node #1224 ( MUL_MAT): ffn_out-33 ( 15K) [CUDA0 ]: blk.33.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-33 ( 60K) [CUDA0 ] node #1225 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-33 ( 15K) [CUDA0 ] node #1226 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.33.post_ffw_norm ( 15K) [CUDA0 ] node #1227 ( ADD): l_out-33 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-33 ( 15K) [CUDA0 ] node #1228 ( RMS_NORM): norm-34 ( 15K) [CUDA0 ]: l_out-33 ( 15K) [CUDA0 ] node #1229 ( MUL): attn_norm-34 ( 15K) [CUDA0 ]: norm-34 ( 15K) [CUDA0 ] blk.34.attn_norm.wei ( 15K) [CUDA0 ] node #1230 ( MUL_MAT): Qcur-34 ( 16K) [CUDA0 ]: blk.34.attn_q.weight ( 8M) [CUDA0 ] attn_norm-34 ( 15K) [CUDA0 ] node #1232 ( RMS_NORM): norm-34 ( 16K) [CUDA0 ]: Qcur-34 (reshaped) ( 16K) [CUDA0 ] node #1233 ( MUL): Qcur_normed-34 ( 16K) [CUDA0 ]: norm-34 ( 16K) [CUDA0 ] blk.34.attn_q_norm.w ( 1K) [CUDA0 ] node #1234 ( ROPE): Qcur-34 ( 16K) [CUDA0 ]: Qcur_normed-34 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1235 ( MUL_MAT): Kcur-34 ( 8K) [CUDA0 ]: blk.34.attn_k.weight ( 4M) [CUDA0 ] attn_norm-34 ( 15K) [CUDA0 ] node #1237 ( RMS_NORM): norm-34 ( 8K) [CUDA0 ]: Kcur-34 (reshaped) ( 8K) [CUDA0 ] node #1238 ( MUL): Kcur_normed-34 ( 8K) [CUDA0 ]: norm-34 ( 8K) [CUDA0 ] blk.34.attn_k_norm.w ( 1K) [CUDA0 ] node #1239 ( ROPE): Kcur-34 ( 8K) [CUDA0 ]: Kcur_normed-34 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1240 ( MUL_MAT): Vcur-34 ( 8K) [CUDA0 ]: blk.34.attn_v.weight ( 4M) [CUDA0 ] attn_norm-34 ( 15K) [CUDA0 ] node #1242 ( CPY): k_cache_view-34 (cop ( 2K) [CUDA0 ]: Kcur-34 ( 8K) [CUDA0 ] k_cache_view-34 ( 2K) [CUDA0 ] node #1244 ( CPY): v_cache_view-34 (cop ( 2K) [CUDA0 ]: Vcur-34 ( 8K) [CUDA0 ] v_cache_view-34 ( 2K) [CUDA0 ]
SPLIT #70: CPU # 3 inputs: [q-34 ( 16K)] [k-34 ( 544K)] [v-34 ( 544K)]
node #1248 (FLASH_ATTN): node_1248 ( 16K) [ CPU ]: CPU#q-34#0 ( 16K) [ NULL ] CPU#k-34#0 ( 544K) [ NULL ] CPU#v-34#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #71: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1250 ( MUL_MAT): kqv_out-34 ( 15K) [CUDA0 ]: blk.34.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1251 ( RMS_NORM): norm-34 ( 15K) [CUDA0 ]: kqv_out-34 ( 15K) [CUDA0 ] node #1252 ( MUL): attn_post_norm-34 ( 15K) [CUDA0 ]: norm-34 ( 15K) [CUDA0 ] blk.34.post_attentio ( 15K) [CUDA0 ] node #1253 ( ADD): sa_out-34 ( 15K) [CUDA0 ]: attn_post_norm-34 ( 15K) [CUDA0 ] l_out-33 ( 15K) [CUDA0 ] node #1254 ( RMS_NORM): norm-34 ( 15K) [CUDA0 ]: sa_out-34 ( 15K) [CUDA0 ] node #1255 ( MUL): ffn_norm-34 ( 15K) [CUDA0 ]: norm-34 ( 15K) [CUDA0 ] blk.34.ffn_norm.weig ( 15K) [CUDA0 ] node #1256 ( MUL_MAT): ffn_gate-34 ( 60K) [CUDA0 ]: blk.34.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-34 ( 15K) [CUDA0 ] node #1257 ( UNARY): ffn_gelu-34 ( 60K) [CUDA0 ]: ffn_gate-34 ( 60K) [CUDA0 ] node #1258 ( MUL_MAT): ffn_up-34 ( 60K) [CUDA0 ]: blk.34.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-34 ( 15K) [CUDA0 ] node #1259 ( MUL): ffn_gate_par-34 ( 60K) [CUDA0 ]: ffn_gelu-34 ( 60K) [CUDA0 ] ffn_up-34 ( 60K) [CUDA0 ] node #1260 ( MUL_MAT): ffn_out-34 ( 15K) [CUDA0 ]: blk.34.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-34 ( 60K) [CUDA0 ] node #1261 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-34 ( 15K) [CUDA0 ] node #1262 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.34.post_ffw_norm ( 15K) [CUDA0 ] node #1263 ( ADD): l_out-34 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-34 ( 15K) [CUDA0 ] node #1264 ( RMS_NORM): norm-35 ( 15K) [CUDA0 ]: l_out-34 ( 15K) [CUDA0 ] node #1265 ( MUL): attn_norm-35 ( 15K) [CUDA0 ]: norm-35 ( 15K) [CUDA0 ] blk.35.attn_norm.wei ( 15K) [CUDA0 ] node #1266 ( MUL_MAT): Qcur-35 ( 16K) [CUDA0 ]: blk.35.attn_q.weight ( 8M) [CUDA0 ] attn_norm-35 ( 15K) [CUDA0 ] node #1268 ( RMS_NORM): norm-35 ( 16K) [CUDA0 ]: Qcur-35 (reshaped) ( 16K) [CUDA0 ] node #1269 ( MUL): Qcur_normed-35 ( 16K) [CUDA0 ]: norm-35 ( 16K) [CUDA0 ] blk.35.attn_q_norm.w ( 1K) [CUDA0 ] node #1270 ( ROPE): Qcur-35 ( 16K) [CUDA0 ]: Qcur_normed-35 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1271 ( MUL_MAT): Kcur-35 ( 8K) [CUDA0 ]: blk.35.attn_k.weight ( 4M) [CUDA0 ] attn_norm-35 ( 15K) [CUDA0 ] node #1273 ( RMS_NORM): norm-35 ( 8K) [CUDA0 ]: Kcur-35 (reshaped) ( 8K) [CUDA0 ] node #1274 ( MUL): Kcur_normed-35 ( 8K) [CUDA0 ]: norm-35 ( 8K) [CUDA0 ] blk.35.attn_k_norm.w ( 1K) [CUDA0 ] node #1275 ( ROPE): Kcur-35 ( 8K) [CUDA0 ]: Kcur_normed-35 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1276 ( MUL_MAT): Vcur-35 ( 8K) [CUDA0 ]: blk.35.attn_v.weight ( 6M) [CUDA0 ] attn_norm-35 ( 15K) [CUDA0 ] node #1278 ( CPY): k_cache_view-35 (cop ( 2K) [CUDA0 ]: Kcur-35 ( 8K) [CUDA0 ] k_cache_view-35 ( 2K) [CUDA0 ] node #1280 ( CPY): v_cache_view-35 (cop ( 2K) [CUDA0 ]: Vcur-35 ( 8K) [CUDA0 ] v_cache_view-35 ( 2K) [CUDA0 ]
SPLIT #72: CPU # 3 inputs: [q-35 ( 16K)] [k-35 ( 544K)] [v-35 ( 544K)]
node #1284 (FLASH_ATTN): node_1284 ( 16K) [ CPU ]: CPU#q-35#0 ( 16K) [ NULL ] CPU#k-35#0 ( 544K) [ NULL ] CPU#v-35#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #73: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1286 ( MUL_MAT): kqv_out-35 ( 15K) [CUDA0 ]: blk.35.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1287 ( RMS_NORM): norm-35 ( 15K) [CUDA0 ]: kqv_out-35 ( 15K) [CUDA0 ] node #1288 ( MUL): attn_post_norm-35 ( 15K) [CUDA0 ]: norm-35 ( 15K) [CUDA0 ] blk.35.post_attentio ( 15K) [CUDA0 ] node #1289 ( ADD): sa_out-35 ( 15K) [CUDA0 ]: attn_post_norm-35 ( 15K) [CUDA0 ] l_out-34 ( 15K) [CUDA0 ] node #1290 ( RMS_NORM): norm-35 ( 15K) [CUDA0 ]: sa_out-35 ( 15K) [CUDA0 ] node #1291 ( MUL): ffn_norm-35 ( 15K) [CUDA0 ]: norm-35 ( 15K) [CUDA0 ] blk.35.ffn_norm.weig ( 15K) [CUDA0 ] node #1292 ( MUL_MAT): ffn_gate-35 ( 60K) [CUDA0 ]: blk.35.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-35 ( 15K) [CUDA0 ] node #1293 ( UNARY): ffn_gelu-35 ( 60K) [CUDA0 ]: ffn_gate-35 ( 60K) [CUDA0 ] node #1294 ( MUL_MAT): ffn_up-35 ( 60K) [CUDA0 ]: blk.35.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-35 ( 15K) [CUDA0 ] node #1295 ( MUL): ffn_gate_par-35 ( 60K) [CUDA0 ]: ffn_gelu-35 ( 60K) [CUDA0 ] ffn_up-35 ( 60K) [CUDA0 ] node #1296 ( MUL_MAT): ffn_out-35 ( 15K) [CUDA0 ]: blk.35.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-35 ( 60K) [CUDA0 ] node #1297 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-35 ( 15K) [CUDA0 ] node #1298 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.35.post_ffw_norm ( 15K) [CUDA0 ] node #1299 ( ADD): l_out-35 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-35 ( 15K) [CUDA0 ] node #1300 ( RMS_NORM): norm-36 ( 15K) [CUDA0 ]: l_out-35 ( 15K) [CUDA0 ] node #1301 ( MUL): attn_norm-36 ( 15K) [CUDA0 ]: norm-36 ( 15K) [CUDA0 ] blk.36.attn_norm.wei ( 15K) [CUDA0 ] node #1302 ( MUL_MAT): Qcur-36 ( 16K) [CUDA0 ]: blk.36.attn_q.weight ( 8M) [CUDA0 ] attn_norm-36 ( 15K) [CUDA0 ] node #1304 ( RMS_NORM): norm-36 ( 16K) [CUDA0 ]: Qcur-36 (reshaped) ( 16K) [CUDA0 ] node #1305 ( MUL): Qcur_normed-36 ( 16K) [CUDA0 ]: norm-36 ( 16K) [CUDA0 ] blk.36.attn_q_norm.w ( 1K) [CUDA0 ] node #1306 ( ROPE): Qcur-36 ( 16K) [CUDA0 ]: Qcur_normed-36 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1307 ( MUL_MAT): Kcur-36 ( 8K) [CUDA0 ]: blk.36.attn_k.weight ( 4M) [CUDA0 ] attn_norm-36 ( 15K) [CUDA0 ] node #1309 ( RMS_NORM): norm-36 ( 8K) [CUDA0 ]: Kcur-36 (reshaped) ( 8K) [CUDA0 ] node #1310 ( MUL): Kcur_normed-36 ( 8K) [CUDA0 ]: norm-36 ( 8K) [CUDA0 ] blk.36.attn_k_norm.w ( 1K) [CUDA0 ] node #1311 ( ROPE): Kcur-36 ( 8K) [CUDA0 ]: Kcur_normed-36 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1312 ( MUL_MAT): Vcur-36 ( 8K) [CUDA0 ]: blk.36.attn_v.weight ( 4M) [CUDA0 ] attn_norm-36 ( 15K) [CUDA0 ] node #1314 ( CPY): k_cache_view-36 (cop ( 2K) [CUDA0 ]: Kcur-36 ( 8K) [CUDA0 ] k_cache_view-36 ( 2K) [CUDA0 ] node #1316 ( CPY): v_cache_view-36 (cop ( 2K) [CUDA0 ]: Vcur-36 ( 8K) [CUDA0 ] v_cache_view-36 ( 2K) [CUDA0 ]
SPLIT #74: CPU # 3 inputs: [q-36 ( 16K)] [k-36 ( 544K)] [v-36 ( 544K)]
node #1320 (FLASH_ATTN): node_1320 ( 16K) [ CPU ]: CPU#q-36#0 ( 16K) [ NULL ] CPU#k-36#0 ( 544K) [ NULL ] CPU#v-36#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #75: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1322 ( MUL_MAT): kqv_out-36 ( 15K) [CUDA0 ]: blk.36.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1323 ( RMS_NORM): norm-36 ( 15K) [CUDA0 ]: kqv_out-36 ( 15K) [CUDA0 ] node #1324 ( MUL): attn_post_norm-36 ( 15K) [CUDA0 ]: norm-36 ( 15K) [CUDA0 ] blk.36.post_attentio ( 15K) [CUDA0 ] node #1325 ( ADD): sa_out-36 ( 15K) [CUDA0 ]: attn_post_norm-36 ( 15K) [CUDA0 ] l_out-35 ( 15K) [CUDA0 ] node #1326 ( RMS_NORM): norm-36 ( 15K) [CUDA0 ]: sa_out-36 ( 15K) [CUDA0 ] node #1327 ( MUL): ffn_norm-36 ( 15K) [CUDA0 ]: norm-36 ( 15K) [CUDA0 ] blk.36.ffn_norm.weig ( 15K) [CUDA0 ] node #1328 ( MUL_MAT): ffn_gate-36 ( 60K) [CUDA0 ]: blk.36.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-36 ( 15K) [CUDA0 ] node #1329 ( UNARY): ffn_gelu-36 ( 60K) [CUDA0 ]: ffn_gate-36 ( 60K) [CUDA0 ] node #1330 ( MUL_MAT): ffn_up-36 ( 60K) [CUDA0 ]: blk.36.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-36 ( 15K) [CUDA0 ] node #1331 ( MUL): ffn_gate_par-36 ( 60K) [CUDA0 ]: ffn_gelu-36 ( 60K) [CUDA0 ] ffn_up-36 ( 60K) [CUDA0 ] node #1332 ( MUL_MAT): ffn_out-36 ( 15K) [CUDA0 ]: blk.36.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-36 ( 60K) [CUDA0 ] node #1333 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-36 ( 15K) [CUDA0 ] node #1334 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.36.post_ffw_norm ( 15K) [CUDA0 ] node #1335 ( ADD): l_out-36 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-36 ( 15K) [CUDA0 ] node #1336 ( RMS_NORM): norm-37 ( 15K) [CUDA0 ]: l_out-36 ( 15K) [CUDA0 ] node #1337 ( MUL): attn_norm-37 ( 15K) [CUDA0 ]: norm-37 ( 15K) [CUDA0 ] blk.37.attn_norm.wei ( 15K) [CUDA0 ] node #1338 ( MUL_MAT): Qcur-37 ( 16K) [CUDA0 ]: blk.37.attn_q.weight ( 8M) [CUDA0 ] attn_norm-37 ( 15K) [CUDA0 ] node #1340 ( RMS_NORM): norm-37 ( 16K) [CUDA0 ]: Qcur-37 (reshaped) ( 16K) [CUDA0 ] node #1341 ( MUL): Qcur_normed-37 ( 16K) [CUDA0 ]: norm-37 ( 16K) [CUDA0 ] blk.37.attn_q_norm.w ( 1K) [CUDA0 ] node #1342 ( ROPE): Qcur-37 ( 16K) [CUDA0 ]: Qcur_normed-37 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1343 ( MUL_MAT): Kcur-37 ( 8K) [CUDA0 ]: blk.37.attn_k.weight ( 4M) [CUDA0 ] attn_norm-37 ( 15K) [CUDA0 ] node #1345 ( RMS_NORM): norm-37 ( 8K) [CUDA0 ]: Kcur-37 (reshaped) ( 8K) [CUDA0 ] node #1346 ( MUL): Kcur_normed-37 ( 8K) [CUDA0 ]: norm-37 ( 8K) [CUDA0 ] blk.37.attn_k_norm.w ( 1K) [CUDA0 ] node #1347 ( ROPE): Kcur-37 ( 8K) [CUDA0 ]: Kcur_normed-37 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1348 ( MUL_MAT): Vcur-37 ( 8K) [CUDA0 ]: blk.37.attn_v.weight ( 4M) [CUDA0 ] attn_norm-37 ( 15K) [CUDA0 ] node #1350 ( CPY): k_cache_view-37 (cop ( 2K) [CUDA0 ]: Kcur-37 ( 8K) [CUDA0 ] k_cache_view-37 ( 2K) [CUDA0 ] node #1352 ( CPY): v_cache_view-37 (cop ( 2K) [CUDA0 ]: Vcur-37 ( 8K) [CUDA0 ] v_cache_view-37 ( 2K) [CUDA0 ]
SPLIT #76: CPU # 3 inputs: [q-37 ( 16K)] [k-37 ( 544K)] [v-37 ( 544K)]
node #1356 (FLASH_ATTN): node_1356 ( 16K) [ CPU ]: CPU#q-37#0 ( 16K) [ NULL ] CPU#k-37#0 ( 544K) [ NULL ] CPU#v-37#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #77: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1358 ( MUL_MAT): kqv_out-37 ( 15K) [CUDA0 ]: blk.37.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1359 ( RMS_NORM): norm-37 ( 15K) [CUDA0 ]: kqv_out-37 ( 15K) [CUDA0 ] node #1360 ( MUL): attn_post_norm-37 ( 15K) [CUDA0 ]: norm-37 ( 15K) [CUDA0 ] blk.37.post_attentio ( 15K) [CUDA0 ] node #1361 ( ADD): sa_out-37 ( 15K) [CUDA0 ]: attn_post_norm-37 ( 15K) [CUDA0 ] l_out-36 ( 15K) [CUDA0 ] node #1362 ( RMS_NORM): norm-37 ( 15K) [CUDA0 ]: sa_out-37 ( 15K) [CUDA0 ] node #1363 ( MUL): ffn_norm-37 ( 15K) [CUDA0 ]: norm-37 ( 15K) [CUDA0 ] blk.37.ffn_norm.weig ( 15K) [CUDA0 ] node #1364 ( MUL_MAT): ffn_gate-37 ( 60K) [CUDA0 ]: blk.37.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-37 ( 15K) [CUDA0 ] node #1365 ( UNARY): ffn_gelu-37 ( 60K) [CUDA0 ]: ffn_gate-37 ( 60K) [CUDA0 ] node #1366 ( MUL_MAT): ffn_up-37 ( 60K) [CUDA0 ]: blk.37.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-37 ( 15K) [CUDA0 ] node #1367 ( MUL): ffn_gate_par-37 ( 60K) [CUDA0 ]: ffn_gelu-37 ( 60K) [CUDA0 ] ffn_up-37 ( 60K) [CUDA0 ] node #1368 ( MUL_MAT): ffn_out-37 ( 15K) [CUDA0 ]: blk.37.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-37 ( 60K) [CUDA0 ] node #1369 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-37 ( 15K) [CUDA0 ] node #1370 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.37.post_ffw_norm ( 15K) [CUDA0 ] node #1371 ( ADD): l_out-37 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-37 ( 15K) [CUDA0 ] node #1372 ( RMS_NORM): norm-38 ( 15K) [CUDA0 ]: l_out-37 ( 15K) [CUDA0 ] node #1373 ( MUL): attn_norm-38 ( 15K) [CUDA0 ]: norm-38 ( 15K) [CUDA0 ] blk.38.attn_norm.wei ( 15K) [CUDA0 ] node #1374 ( MUL_MAT): Qcur-38 ( 16K) [CUDA0 ]: blk.38.attn_q.weight ( 8M) [CUDA0 ] attn_norm-38 ( 15K) [CUDA0 ] node #1376 ( RMS_NORM): norm-38 ( 16K) [CUDA0 ]: Qcur-38 (reshaped) ( 16K) [CUDA0 ] node #1377 ( MUL): Qcur_normed-38 ( 16K) [CUDA0 ]: norm-38 ( 16K) [CUDA0 ] blk.38.attn_q_norm.w ( 1K) [CUDA0 ] node #1378 ( ROPE): Qcur-38 ( 16K) [CUDA0 ]: Qcur_normed-38 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1379 ( MUL_MAT): Kcur-38 ( 8K) [CUDA0 ]: blk.38.attn_k.weight ( 4M) [CUDA0 ] attn_norm-38 ( 15K) [CUDA0 ] node #1381 ( RMS_NORM): norm-38 ( 8K) [CUDA0 ]: Kcur-38 (reshaped) ( 8K) [CUDA0 ] node #1382 ( MUL): Kcur_normed-38 ( 8K) [CUDA0 ]: norm-38 ( 8K) [CUDA0 ] blk.38.attn_k_norm.w ( 1K) [CUDA0 ] node #1383 ( ROPE): Kcur-38 ( 8K) [CUDA0 ]: Kcur_normed-38 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1384 ( MUL_MAT): Vcur-38 ( 8K) [CUDA0 ]: blk.38.attn_v.weight ( 6M) [CUDA0 ] attn_norm-38 ( 15K) [CUDA0 ] node #1386 ( CPY): k_cache_view-38 (cop ( 2K) [CUDA0 ]: Kcur-38 ( 8K) [CUDA0 ] k_cache_view-38 ( 2K) [CUDA0 ] node #1388 ( CPY): v_cache_view-38 (cop ( 2K) [CUDA0 ]: Vcur-38 ( 8K) [CUDA0 ] v_cache_view-38 ( 2K) [CUDA0 ]
SPLIT #78: CPU # 3 inputs: [q-38 ( 16K)] [k-38 ( 544K)] [v-38 ( 544K)]
node #1392 (FLASH_ATTN): node_1392 ( 16K) [ CPU ]: CPU#q-38#0 ( 16K) [ NULL ] CPU#k-38#0 ( 544K) [ NULL ] CPU#v-38#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #79: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1394 ( MUL_MAT): kqv_out-38 ( 15K) [CUDA0 ]: blk.38.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1395 ( RMS_NORM): norm-38 ( 15K) [CUDA0 ]: kqv_out-38 ( 15K) [CUDA0 ] node #1396 ( MUL): attn_post_norm-38 ( 15K) [CUDA0 ]: norm-38 ( 15K) [CUDA0 ] blk.38.post_attentio ( 15K) [CUDA0 ] node #1397 ( ADD): sa_out-38 ( 15K) [CUDA0 ]: attn_post_norm-38 ( 15K) [CUDA0 ] l_out-37 ( 15K) [CUDA0 ] node #1398 ( RMS_NORM): norm-38 ( 15K) [CUDA0 ]: sa_out-38 ( 15K) [CUDA0 ] node #1399 ( MUL): ffn_norm-38 ( 15K) [CUDA0 ]: norm-38 ( 15K) [CUDA0 ] blk.38.ffn_norm.weig ( 15K) [CUDA0 ] node #1400 ( MUL_MAT): ffn_gate-38 ( 60K) [CUDA0 ]: blk.38.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-38 ( 15K) [CUDA0 ] node #1401 ( UNARY): ffn_gelu-38 ( 60K) [CUDA0 ]: ffn_gate-38 ( 60K) [CUDA0 ] node #1402 ( MUL_MAT): ffn_up-38 ( 60K) [CUDA0 ]: blk.38.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-38 ( 15K) [CUDA0 ] node #1403 ( MUL): ffn_gate_par-38 ( 60K) [CUDA0 ]: ffn_gelu-38 ( 60K) [CUDA0 ] ffn_up-38 ( 60K) [CUDA0 ] node #1404 ( MUL_MAT): ffn_out-38 ( 15K) [CUDA0 ]: blk.38.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-38 ( 60K) [CUDA0 ] node #1405 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-38 ( 15K) [CUDA0 ] node #1406 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.38.post_ffw_norm ( 15K) [CUDA0 ] node #1407 ( ADD): l_out-38 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-38 ( 15K) [CUDA0 ] node #1408 ( RMS_NORM): norm-39 ( 15K) [CUDA0 ]: l_out-38 ( 15K) [CUDA0 ] node #1409 ( MUL): attn_norm-39 ( 15K) [CUDA0 ]: norm-39 ( 15K) [CUDA0 ] blk.39.attn_norm.wei ( 15K) [CUDA0 ] node #1410 ( MUL_MAT): Qcur-39 ( 16K) [CUDA0 ]: blk.39.attn_q.weight ( 8M) [CUDA0 ] attn_norm-39 ( 15K) [CUDA0 ] node #1412 ( RMS_NORM): norm-39 ( 16K) [CUDA0 ]: Qcur-39 (reshaped) ( 16K) [CUDA0 ] node #1413 ( MUL): Qcur_normed-39 ( 16K) [CUDA0 ]: norm-39 ( 16K) [CUDA0 ] blk.39.attn_q_norm.w ( 1K) [CUDA0 ] node #1414 ( ROPE): Qcur-39 ( 16K) [CUDA0 ]: Qcur_normed-39 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1415 ( MUL_MAT): Kcur-39 ( 8K) [CUDA0 ]: blk.39.attn_k.weight ( 4M) [CUDA0 ] attn_norm-39 ( 15K) [CUDA0 ] node #1417 ( RMS_NORM): norm-39 ( 8K) [CUDA0 ]: Kcur-39 (reshaped) ( 8K) [CUDA0 ] node #1418 ( MUL): Kcur_normed-39 ( 8K) [CUDA0 ]: norm-39 ( 8K) [CUDA0 ] blk.39.attn_k_norm.w ( 1K) [CUDA0 ] node #1419 ( ROPE): Kcur-39 ( 8K) [CUDA0 ]: Kcur_normed-39 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1420 ( MUL_MAT): Vcur-39 ( 8K) [CUDA0 ]: blk.39.attn_v.weight ( 4M) [CUDA0 ] attn_norm-39 ( 15K) [CUDA0 ] node #1422 ( CPY): k_cache_view-39 (cop ( 2K) [CUDA0 ]: Kcur-39 ( 8K) [CUDA0 ] k_cache_view-39 ( 2K) [CUDA0 ] node #1424 ( CPY): v_cache_view-39 (cop ( 2K) [CUDA0 ]: Vcur-39 ( 8K) [CUDA0 ] v_cache_view-39 ( 2K) [CUDA0 ]
SPLIT #80: CPU # 3 inputs: [q-39 ( 16K)] [k-39 ( 544K)] [v-39 ( 544K)]
node #1428 (FLASH_ATTN): node_1428 ( 16K) [ CPU ]: CPU#q-39#0 ( 16K) [ NULL ] CPU#k-39#0 ( 544K) [ NULL ] CPU#v-39#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #81: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1430 ( MUL_MAT): kqv_out-39 ( 15K) [CUDA0 ]: blk.39.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1431 ( RMS_NORM): norm-39 ( 15K) [CUDA0 ]: kqv_out-39 ( 15K) [CUDA0 ] node #1432 ( MUL): attn_post_norm-39 ( 15K) [CUDA0 ]: norm-39 ( 15K) [CUDA0 ] blk.39.post_attentio ( 15K) [CUDA0 ] node #1433 ( ADD): sa_out-39 ( 15K) [CUDA0 ]: attn_post_norm-39 ( 15K) [CUDA0 ] l_out-38 ( 15K) [CUDA0 ] node #1434 ( RMS_NORM): norm-39 ( 15K) [CUDA0 ]: sa_out-39 ( 15K) [CUDA0 ] node #1435 ( MUL): ffn_norm-39 ( 15K) [CUDA0 ]: norm-39 ( 15K) [CUDA0 ] blk.39.ffn_norm.weig ( 15K) [CUDA0 ] node #1436 ( MUL_MAT): ffn_gate-39 ( 60K) [CUDA0 ]: blk.39.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-39 ( 15K) [CUDA0 ] node #1437 ( UNARY): ffn_gelu-39 ( 60K) [CUDA0 ]: ffn_gate-39 ( 60K) [CUDA0 ] node #1438 ( MUL_MAT): ffn_up-39 ( 60K) [CUDA0 ]: blk.39.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-39 ( 15K) [CUDA0 ] node #1439 ( MUL): ffn_gate_par-39 ( 60K) [CUDA0 ]: ffn_gelu-39 ( 60K) [CUDA0 ] ffn_up-39 ( 60K) [CUDA0 ] node #1440 ( MUL_MAT): ffn_out-39 ( 15K) [CUDA0 ]: blk.39.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-39 ( 60K) [CUDA0 ] node #1441 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-39 ( 15K) [CUDA0 ] node #1442 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.39.post_ffw_norm ( 15K) [CUDA0 ] node #1443 ( ADD): l_out-39 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-39 ( 15K) [CUDA0 ] node #1444 ( RMS_NORM): norm-40 ( 15K) [CUDA0 ]: l_out-39 ( 15K) [CUDA0 ] node #1445 ( MUL): attn_norm-40 ( 15K) [CUDA0 ]: norm-40 ( 15K) [CUDA0 ] blk.40.attn_norm.wei ( 15K) [CUDA0 ] node #1446 ( MUL_MAT): Qcur-40 ( 16K) [CUDA0 ]: blk.40.attn_q.weight ( 8M) [CUDA0 ] attn_norm-40 ( 15K) [CUDA0 ] node #1448 ( RMS_NORM): norm-40 ( 16K) [CUDA0 ]: Qcur-40 (reshaped) ( 16K) [CUDA0 ] node #1449 ( MUL): Qcur_normed-40 ( 16K) [CUDA0 ]: norm-40 ( 16K) [CUDA0 ] blk.40.attn_q_norm.w ( 1K) [CUDA0 ] node #1450 ( ROPE): Qcur-40 ( 16K) [CUDA0 ]: Qcur_normed-40 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1451 ( MUL_MAT): Kcur-40 ( 8K) [CUDA0 ]: blk.40.attn_k.weight ( 4M) [CUDA0 ] attn_norm-40 ( 15K) [CUDA0 ] node #1453 ( RMS_NORM): norm-40 ( 8K) [CUDA0 ]: Kcur-40 (reshaped) ( 8K) [CUDA0 ] node #1454 ( MUL): Kcur_normed-40 ( 8K) [CUDA0 ]: norm-40 ( 8K) [CUDA0 ] blk.40.attn_k_norm.w ( 1K) [CUDA0 ] node #1455 ( ROPE): Kcur-40 ( 8K) [CUDA0 ]: Kcur_normed-40 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1456 ( MUL_MAT): Vcur-40 ( 8K) [CUDA0 ]: blk.40.attn_v.weight ( 4M) [CUDA0 ] attn_norm-40 ( 15K) [CUDA0 ] node #1458 ( CPY): k_cache_view-40 (cop ( 2K) [CUDA0 ]: Kcur-40 ( 8K) [CUDA0 ] k_cache_view-40 ( 2K) [CUDA0 ] node #1460 ( CPY): v_cache_view-40 (cop ( 2K) [CUDA0 ]: Vcur-40 ( 8K) [CUDA0 ] v_cache_view-40 ( 2K) [CUDA0 ]
SPLIT #82: CPU # 3 inputs: [q-40 ( 16K)] [k-40 ( 544K)] [v-40 ( 544K)]
node #1464 (FLASH_ATTN): node_1464 ( 16K) [ CPU ]: CPU#q-40#0 ( 16K) [ NULL ] CPU#k-40#0 ( 544K) [ NULL ] CPU#v-40#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #83: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1466 ( MUL_MAT): kqv_out-40 ( 15K) [CUDA0 ]: blk.40.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1467 ( RMS_NORM): norm-40 ( 15K) [CUDA0 ]: kqv_out-40 ( 15K) [CUDA0 ] node #1468 ( MUL): attn_post_norm-40 ( 15K) [CUDA0 ]: norm-40 ( 15K) [CUDA0 ] blk.40.post_attentio ( 15K) [CUDA0 ] node #1469 ( ADD): sa_out-40 ( 15K) [CUDA0 ]: attn_post_norm-40 ( 15K) [CUDA0 ] l_out-39 ( 15K) [CUDA0 ] node #1470 ( RMS_NORM): norm-40 ( 15K) [CUDA0 ]: sa_out-40 ( 15K) [CUDA0 ] node #1471 ( MUL): ffn_norm-40 ( 15K) [CUDA0 ]: norm-40 ( 15K) [CUDA0 ] blk.40.ffn_norm.weig ( 15K) [CUDA0 ] node #1472 ( MUL_MAT): ffn_gate-40 ( 60K) [CUDA0 ]: blk.40.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-40 ( 15K) [CUDA0 ] node #1473 ( UNARY): ffn_gelu-40 ( 60K) [CUDA0 ]: ffn_gate-40 ( 60K) [CUDA0 ] node #1474 ( MUL_MAT): ffn_up-40 ( 60K) [CUDA0 ]: blk.40.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-40 ( 15K) [CUDA0 ] node #1475 ( MUL): ffn_gate_par-40 ( 60K) [CUDA0 ]: ffn_gelu-40 ( 60K) [CUDA0 ] ffn_up-40 ( 60K) [CUDA0 ] node #1476 ( MUL_MAT): ffn_out-40 ( 15K) [CUDA0 ]: blk.40.ffn_down.weig ( 31M) [CUDA0 ] ffn_gate_par-40 ( 60K) [CUDA0 ] node #1477 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-40 ( 15K) [CUDA0 ] node #1478 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.40.post_ffw_norm ( 15K) [CUDA0 ] node #1479 ( ADD): l_out-40 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-40 ( 15K) [CUDA0 ] node #1480 ( RMS_NORM): norm-41 ( 15K) [CUDA0 ]: l_out-40 ( 15K) [CUDA0 ] node #1481 ( MUL): attn_norm-41 ( 15K) [CUDA0 ]: norm-41 ( 15K) [CUDA0 ] blk.41.attn_norm.wei ( 15K) [CUDA0 ] node #1482 ( MUL_MAT): Qcur-41 ( 16K) [CUDA0 ]: blk.41.attn_q.weight ( 8M) [CUDA0 ] attn_norm-41 ( 15K) [CUDA0 ] node #1484 ( RMS_NORM): norm-41 ( 16K) [CUDA0 ]: Qcur-41 (reshaped) ( 16K) [CUDA0 ] node #1485 ( MUL): Qcur_normed-41 ( 16K) [CUDA0 ]: norm-41 ( 16K) [CUDA0 ] blk.41.attn_q_norm.w ( 1K) [CUDA0 ] node #1486 ( ROPE): Qcur-41 ( 16K) [CUDA0 ]: Qcur_normed-41 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1487 ( MUL_MAT): Kcur-41 ( 8K) [CUDA0 ]: blk.41.attn_k.weight ( 4M) [CUDA0 ] attn_norm-41 ( 15K) [CUDA0 ] node #1489 ( RMS_NORM): norm-41 ( 8K) [CUDA0 ]: Kcur-41 (reshaped) ( 8K) [CUDA0 ] node #1490 ( MUL): Kcur_normed-41 ( 8K) [CUDA0 ]: norm-41 ( 8K) [CUDA0 ] blk.41.attn_k_norm.w ( 1K) [CUDA0 ] node #1491 ( ROPE): Kcur-41 ( 8K) [CUDA0 ]: Kcur_normed-41 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1492 ( MUL_MAT): Vcur-41 ( 8K) [CUDA0 ]: blk.41.attn_v.weight ( 6M) [CUDA0 ] attn_norm-41 ( 15K) [CUDA0 ] node #1494 ( CPY): k_cache_view-41 (cop ( 2K) [CUDA0 ]: Kcur-41 ( 8K) [CUDA0 ] k_cache_view-41 ( 2K) [CUDA0 ] node #1496 ( CPY): v_cache_view-41 (cop ( 2K) [CUDA0 ]: Vcur-41 ( 8K) [CUDA0 ] v_cache_view-41 ( 2K) [CUDA0 ]
SPLIT #84: CPU # 3 inputs: [q-41 ( 16K)] [k-41 ( 544K)] [v-41 ( 544K)]
node #1500 (FLASH_ATTN): node_1500 ( 16K) [ CPU ]: CPU#q-41#0 ( 16K) [ NULL ] CPU#k-41#0 ( 544K) [ NULL ] CPU#v-41#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #85: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1502 ( MUL_MAT): kqv_out-41 ( 15K) [CUDA0 ]: blk.41.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1503 ( RMS_NORM): norm-41 ( 15K) [CUDA0 ]: kqv_out-41 ( 15K) [CUDA0 ] node #1504 ( MUL): attn_post_norm-41 ( 15K) [CUDA0 ]: norm-41 ( 15K) [CUDA0 ] blk.41.post_attentio ( 15K) [CUDA0 ] node #1505 ( ADD): sa_out-41 ( 15K) [CUDA0 ]: attn_post_norm-41 ( 15K) [CUDA0 ] l_out-40 ( 15K) [CUDA0 ] node #1506 ( RMS_NORM): norm-41 ( 15K) [CUDA0 ]: sa_out-41 ( 15K) [CUDA0 ] node #1507 ( MUL): ffn_norm-41 ( 15K) [CUDA0 ]: norm-41 ( 15K) [CUDA0 ] blk.41.ffn_norm.weig ( 15K) [CUDA0 ] node #1508 ( MUL_MAT): ffn_gate-41 ( 60K) [CUDA0 ]: blk.41.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-41 ( 15K) [CUDA0 ] node #1509 ( UNARY): ffn_gelu-41 ( 60K) [CUDA0 ]: ffn_gate-41 ( 60K) [CUDA0 ] node #1510 ( MUL_MAT): ffn_up-41 ( 60K) [CUDA0 ]: blk.41.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-41 ( 15K) [CUDA0 ] node #1511 ( MUL): ffn_gate_par-41 ( 60K) [CUDA0 ]: ffn_gelu-41 ( 60K) [CUDA0 ] ffn_up-41 ( 60K) [CUDA0 ] node #1512 ( MUL_MAT): ffn_out-41 ( 15K) [CUDA0 ]: blk.41.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-41 ( 60K) [CUDA0 ] node #1513 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-41 ( 15K) [CUDA0 ] node #1514 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.41.post_ffw_norm ( 15K) [CUDA0 ] node #1515 ( ADD): l_out-41 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-41 ( 15K) [CUDA0 ] node #1516 ( RMS_NORM): norm-42 ( 15K) [CUDA0 ]: l_out-41 ( 15K) [CUDA0 ] node #1517 ( MUL): attn_norm-42 ( 15K) [CUDA0 ]: norm-42 ( 15K) [CUDA0 ] blk.42.attn_norm.wei ( 15K) [CUDA0 ] node #1518 ( MUL_MAT): Qcur-42 ( 16K) [CUDA0 ]: blk.42.attn_q.weight ( 8M) [CUDA0 ] attn_norm-42 ( 15K) [CUDA0 ] node #1520 ( RMS_NORM): norm-42 ( 16K) [CUDA0 ]: Qcur-42 (reshaped) ( 16K) [CUDA0 ] node #1521 ( MUL): Qcur_normed-42 ( 16K) [CUDA0 ]: norm-42 ( 16K) [CUDA0 ] blk.42.attn_q_norm.w ( 1K) [CUDA0 ] node #1522 ( ROPE): Qcur-42 ( 16K) [CUDA0 ]: Qcur_normed-42 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1523 ( MUL_MAT): Kcur-42 ( 8K) [CUDA0 ]: blk.42.attn_k.weight ( 4M) [CUDA0 ] attn_norm-42 ( 15K) [CUDA0 ] node #1525 ( RMS_NORM): norm-42 ( 8K) [CUDA0 ]: Kcur-42 (reshaped) ( 8K) [CUDA0 ] node #1526 ( MUL): Kcur_normed-42 ( 8K) [CUDA0 ]: norm-42 ( 8K) [CUDA0 ] blk.42.attn_k_norm.w ( 1K) [CUDA0 ] node #1527 ( ROPE): Kcur-42 ( 8K) [CUDA0 ]: Kcur_normed-42 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1528 ( MUL_MAT): Vcur-42 ( 8K) [CUDA0 ]: blk.42.attn_v.weight ( 6M) [CUDA0 ] attn_norm-42 ( 15K) [CUDA0 ] node #1530 ( CPY): k_cache_view-42 (cop ( 2K) [CUDA0 ]: Kcur-42 ( 8K) [CUDA0 ] k_cache_view-42 ( 2K) [CUDA0 ] node #1532 ( CPY): v_cache_view-42 (cop ( 2K) [CUDA0 ]: Vcur-42 ( 8K) [CUDA0 ] v_cache_view-42 ( 2K) [CUDA0 ]
SPLIT #86: CPU # 3 inputs: [q-42 ( 16K)] [k-42 ( 544K)] [v-42 ( 544K)]
node #1536 (FLASH_ATTN): node_1536 ( 16K) [ CPU ]: CPU#q-42#0 ( 16K) [ NULL ] CPU#k-42#0 ( 544K) [ NULL ] CPU#v-42#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #87: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1538 ( MUL_MAT): kqv_out-42 ( 15K) [CUDA0 ]: blk.42.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1539 ( RMS_NORM): norm-42 ( 15K) [CUDA0 ]: kqv_out-42 ( 15K) [CUDA0 ] node #1540 ( MUL): attn_post_norm-42 ( 15K) [CUDA0 ]: norm-42 ( 15K) [CUDA0 ] blk.42.post_attentio ( 15K) [CUDA0 ] node #1541 ( ADD): sa_out-42 ( 15K) [CUDA0 ]: attn_post_norm-42 ( 15K) [CUDA0 ] l_out-41 ( 15K) [CUDA0 ] node #1542 ( RMS_NORM): norm-42 ( 15K) [CUDA0 ]: sa_out-42 ( 15K) [CUDA0 ] node #1543 ( MUL): ffn_norm-42 ( 15K) [CUDA0 ]: norm-42 ( 15K) [CUDA0 ] blk.42.ffn_norm.weig ( 15K) [CUDA0 ] node #1544 ( MUL_MAT): ffn_gate-42 ( 60K) [CUDA0 ]: blk.42.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-42 ( 15K) [CUDA0 ] node #1545 ( UNARY): ffn_gelu-42 ( 60K) [CUDA0 ]: ffn_gate-42 ( 60K) [CUDA0 ] node #1546 ( MUL_MAT): ffn_up-42 ( 60K) [CUDA0 ]: blk.42.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-42 ( 15K) [CUDA0 ] node #1547 ( MUL): ffn_gate_par-42 ( 60K) [CUDA0 ]: ffn_gelu-42 ( 60K) [CUDA0 ] ffn_up-42 ( 60K) [CUDA0 ] node #1548 ( MUL_MAT): ffn_out-42 ( 15K) [CUDA0 ]: blk.42.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-42 ( 60K) [CUDA0 ] node #1549 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-42 ( 15K) [CUDA0 ] node #1550 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.42.post_ffw_norm ( 15K) [CUDA0 ] node #1551 ( ADD): l_out-42 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-42 ( 15K) [CUDA0 ] node #1552 ( RMS_NORM): norm-43 ( 15K) [CUDA0 ]: l_out-42 ( 15K) [CUDA0 ] node #1553 ( MUL): attn_norm-43 ( 15K) [CUDA0 ]: norm-43 ( 15K) [CUDA0 ] blk.43.attn_norm.wei ( 15K) [CUDA0 ] node #1554 ( MUL_MAT): Qcur-43 ( 16K) [CUDA0 ]: blk.43.attn_q.weight ( 8M) [CUDA0 ] attn_norm-43 ( 15K) [CUDA0 ] node #1556 ( RMS_NORM): norm-43 ( 16K) [CUDA0 ]: Qcur-43 (reshaped) ( 16K) [CUDA0 ] node #1557 ( MUL): Qcur_normed-43 ( 16K) [CUDA0 ]: norm-43 ( 16K) [CUDA0 ] blk.43.attn_q_norm.w ( 1K) [CUDA0 ] node #1558 ( ROPE): Qcur-43 ( 16K) [CUDA0 ]: Qcur_normed-43 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1559 ( MUL_MAT): Kcur-43 ( 8K) [CUDA0 ]: blk.43.attn_k.weight ( 4M) [CUDA0 ] attn_norm-43 ( 15K) [CUDA0 ] node #1561 ( RMS_NORM): norm-43 ( 8K) [CUDA0 ]: Kcur-43 (reshaped) ( 8K) [CUDA0 ] node #1562 ( MUL): Kcur_normed-43 ( 8K) [CUDA0 ]: norm-43 ( 8K) [CUDA0 ] blk.43.attn_k_norm.w ( 1K) [CUDA0 ] node #1563 ( ROPE): Kcur-43 ( 8K) [CUDA0 ]: Kcur_normed-43 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1564 ( MUL_MAT): Vcur-43 ( 8K) [CUDA0 ]: blk.43.attn_v.weight ( 6M) [CUDA0 ] attn_norm-43 ( 15K) [CUDA0 ] node #1566 ( CPY): k_cache_view-43 (cop ( 2K) [CUDA0 ]: Kcur-43 ( 8K) [CUDA0 ] k_cache_view-43 ( 2K) [CUDA0 ] node #1568 ( CPY): v_cache_view-43 (cop ( 2K) [CUDA0 ]: Vcur-43 ( 8K) [CUDA0 ] v_cache_view-43 ( 2K) [CUDA0 ]
SPLIT #88: CPU # 3 inputs: [q-43 ( 16K)] [k-43 ( 544K)] [v-43 ( 544K)]
node #1572 (FLASH_ATTN): node_1572 ( 16K) [ CPU ]: CPU#q-43#0 ( 16K) [ NULL ] CPU#k-43#0 ( 544K) [ NULL ] CPU#v-43#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #89: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1574 ( MUL_MAT): kqv_out-43 ( 15K) [CUDA0 ]: blk.43.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1575 ( RMS_NORM): norm-43 ( 15K) [CUDA0 ]: kqv_out-43 ( 15K) [CUDA0 ] node #1576 ( MUL): attn_post_norm-43 ( 15K) [CUDA0 ]: norm-43 ( 15K) [CUDA0 ] blk.43.post_attentio ( 15K) [CUDA0 ] node #1577 ( ADD): sa_out-43 ( 15K) [CUDA0 ]: attn_post_norm-43 ( 15K) [CUDA0 ] l_out-42 ( 15K) [CUDA0 ] node #1578 ( RMS_NORM): norm-43 ( 15K) [CUDA0 ]: sa_out-43 ( 15K) [CUDA0 ] node #1579 ( MUL): ffn_norm-43 ( 15K) [CUDA0 ]: norm-43 ( 15K) [CUDA0 ] blk.43.ffn_norm.weig ( 15K) [CUDA0 ] node #1580 ( MUL_MAT): ffn_gate-43 ( 60K) [CUDA0 ]: blk.43.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-43 ( 15K) [CUDA0 ] node #1581 ( UNARY): ffn_gelu-43 ( 60K) [CUDA0 ]: ffn_gate-43 ( 60K) [CUDA0 ] node #1582 ( MUL_MAT): ffn_up-43 ( 60K) [CUDA0 ]: blk.43.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-43 ( 15K) [CUDA0 ] node #1583 ( MUL): ffn_gate_par-43 ( 60K) [CUDA0 ]: ffn_gelu-43 ( 60K) [CUDA0 ] ffn_up-43 ( 60K) [CUDA0 ] node #1584 ( MUL_MAT): ffn_out-43 ( 15K) [CUDA0 ]: blk.43.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-43 ( 60K) [CUDA0 ] node #1585 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-43 ( 15K) [CUDA0 ] node #1586 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.43.post_ffw_norm ( 15K) [CUDA0 ] node #1587 ( ADD): l_out-43 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-43 ( 15K) [CUDA0 ] node #1588 ( RMS_NORM): norm-44 ( 15K) [CUDA0 ]: l_out-43 ( 15K) [CUDA0 ] node #1589 ( MUL): attn_norm-44 ( 15K) [CUDA0 ]: norm-44 ( 15K) [CUDA0 ] blk.44.attn_norm.wei ( 15K) [CUDA0 ] node #1590 ( MUL_MAT): Qcur-44 ( 16K) [CUDA0 ]: blk.44.attn_q.weight ( 8M) [CUDA0 ] attn_norm-44 ( 15K) [CUDA0 ] node #1592 ( RMS_NORM): norm-44 ( 16K) [CUDA0 ]: Qcur-44 (reshaped) ( 16K) [CUDA0 ] node #1593 ( MUL): Qcur_normed-44 ( 16K) [CUDA0 ]: norm-44 ( 16K) [CUDA0 ] blk.44.attn_q_norm.w ( 1K) [CUDA0 ] node #1594 ( ROPE): Qcur-44 ( 16K) [CUDA0 ]: Qcur_normed-44 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1595 ( MUL_MAT): Kcur-44 ( 8K) [CUDA0 ]: blk.44.attn_k.weight ( 4M) [CUDA0 ] attn_norm-44 ( 15K) [CUDA0 ] node #1597 ( RMS_NORM): norm-44 ( 8K) [CUDA0 ]: Kcur-44 (reshaped) ( 8K) [CUDA0 ] node #1598 ( MUL): Kcur_normed-44 ( 8K) [CUDA0 ]: norm-44 ( 8K) [CUDA0 ] blk.44.attn_k_norm.w ( 1K) [CUDA0 ] node #1599 ( ROPE): Kcur-44 ( 8K) [CUDA0 ]: Kcur_normed-44 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1600 ( MUL_MAT): Vcur-44 ( 8K) [CUDA0 ]: blk.44.attn_v.weight ( 6M) [CUDA0 ] attn_norm-44 ( 15K) [CUDA0 ] node #1602 ( CPY): k_cache_view-44 (cop ( 2K) [CUDA0 ]: Kcur-44 ( 8K) [CUDA0 ] k_cache_view-44 ( 2K) [CUDA0 ] node #1604 ( CPY): v_cache_view-44 (cop ( 2K) [CUDA0 ]: Vcur-44 ( 8K) [CUDA0 ] v_cache_view-44 ( 2K) [CUDA0 ]
SPLIT #90: CPU # 3 inputs: [q-44 ( 16K)] [k-44 ( 544K)] [v-44 ( 544K)]
node #1608 (FLASH_ATTN): node_1608 ( 16K) [ CPU ]: CPU#q-44#0 ( 16K) [ NULL ] CPU#k-44#0 ( 544K) [ NULL ] CPU#v-44#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #91: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1610 ( MUL_MAT): kqv_out-44 ( 15K) [CUDA0 ]: blk.44.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1611 ( RMS_NORM): norm-44 ( 15K) [CUDA0 ]: kqv_out-44 ( 15K) [CUDA0 ] node #1612 ( MUL): attn_post_norm-44 ( 15K) [CUDA0 ]: norm-44 ( 15K) [CUDA0 ] blk.44.post_attentio ( 15K) [CUDA0 ] node #1613 ( ADD): sa_out-44 ( 15K) [CUDA0 ]: attn_post_norm-44 ( 15K) [CUDA0 ] l_out-43 ( 15K) [CUDA0 ] node #1614 ( RMS_NORM): norm-44 ( 15K) [CUDA0 ]: sa_out-44 ( 15K) [CUDA0 ] node #1615 ( MUL): ffn_norm-44 ( 15K) [CUDA0 ]: norm-44 ( 15K) [CUDA0 ] blk.44.ffn_norm.weig ( 15K) [CUDA0 ] node #1616 ( MUL_MAT): ffn_gate-44 ( 60K) [CUDA0 ]: blk.44.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-44 ( 15K) [CUDA0 ] node #1617 ( UNARY): ffn_gelu-44 ( 60K) [CUDA0 ]: ffn_gate-44 ( 60K) [CUDA0 ] node #1618 ( MUL_MAT): ffn_up-44 ( 60K) [CUDA0 ]: blk.44.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-44 ( 15K) [CUDA0 ] node #1619 ( MUL): ffn_gate_par-44 ( 60K) [CUDA0 ]: ffn_gelu-44 ( 60K) [CUDA0 ] ffn_up-44 ( 60K) [CUDA0 ] node #1620 ( MUL_MAT): ffn_out-44 ( 15K) [CUDA0 ]: blk.44.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-44 ( 60K) [CUDA0 ] node #1621 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-44 ( 15K) [CUDA0 ] node #1622 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.44.post_ffw_norm ( 15K) [CUDA0 ] node #1623 ( ADD): l_out-44 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-44 ( 15K) [CUDA0 ] node #1624 ( RMS_NORM): norm-45 ( 15K) [CUDA0 ]: l_out-44 ( 15K) [CUDA0 ] node #1625 ( MUL): attn_norm-45 ( 15K) [CUDA0 ]: norm-45 ( 15K) [CUDA0 ] blk.45.attn_norm.wei ( 15K) [CUDA0 ] node #1626 ( MUL_MAT): Qcur-45 ( 16K) [CUDA0 ]: blk.45.attn_q.weight ( 8M) [CUDA0 ] attn_norm-45 ( 15K) [CUDA0 ] node #1628 ( RMS_NORM): norm-45 ( 16K) [CUDA0 ]: Qcur-45 (reshaped) ( 16K) [CUDA0 ] node #1629 ( MUL): Qcur_normed-45 ( 16K) [CUDA0 ]: norm-45 ( 16K) [CUDA0 ] blk.45.attn_q_norm.w ( 1K) [CUDA0 ] node #1630 ( ROPE): Qcur-45 ( 16K) [CUDA0 ]: Qcur_normed-45 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1631 ( MUL_MAT): Kcur-45 ( 8K) [CUDA0 ]: blk.45.attn_k.weight ( 4M) [CUDA0 ] attn_norm-45 ( 15K) [CUDA0 ] node #1633 ( RMS_NORM): norm-45 ( 8K) [CUDA0 ]: Kcur-45 (reshaped) ( 8K) [CUDA0 ] node #1634 ( MUL): Kcur_normed-45 ( 8K) [CUDA0 ]: norm-45 ( 8K) [CUDA0 ] blk.45.attn_k_norm.w ( 1K) [CUDA0 ] node #1635 ( ROPE): Kcur-45 ( 8K) [CUDA0 ]: Kcur_normed-45 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1636 ( MUL_MAT): Vcur-45 ( 8K) [CUDA0 ]: blk.45.attn_v.weight ( 6M) [CUDA0 ] attn_norm-45 ( 15K) [CUDA0 ] node #1638 ( CPY): k_cache_view-45 (cop ( 2K) [CUDA0 ]: Kcur-45 ( 8K) [CUDA0 ] k_cache_view-45 ( 2K) [CUDA0 ] node #1640 ( CPY): v_cache_view-45 (cop ( 2K) [CUDA0 ]: Vcur-45 ( 8K) [CUDA0 ] v_cache_view-45 ( 2K) [CUDA0 ]
SPLIT #92: CPU # 3 inputs: [q-45 ( 16K)] [k-45 ( 544K)] [v-45 ( 544K)]
node #1644 (FLASH_ATTN): node_1644 ( 16K) [ CPU ]: CPU#q-45#0 ( 16K) [ NULL ] CPU#k-45#0 ( 544K) [ NULL ] CPU#v-45#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #93: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1646 ( MUL_MAT): kqv_out-45 ( 15K) [CUDA0 ]: blk.45.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1647 ( RMS_NORM): norm-45 ( 15K) [CUDA0 ]: kqv_out-45 ( 15K) [CUDA0 ] node #1648 ( MUL): attn_post_norm-45 ( 15K) [CUDA0 ]: norm-45 ( 15K) [CUDA0 ] blk.45.post_attentio ( 15K) [CUDA0 ] node #1649 ( ADD): sa_out-45 ( 15K) [CUDA0 ]: attn_post_norm-45 ( 15K) [CUDA0 ] l_out-44 ( 15K) [CUDA0 ] node #1650 ( RMS_NORM): norm-45 ( 15K) [CUDA0 ]: sa_out-45 ( 15K) [CUDA0 ] node #1651 ( MUL): ffn_norm-45 ( 15K) [CUDA0 ]: norm-45 ( 15K) [CUDA0 ] blk.45.ffn_norm.weig ( 15K) [CUDA0 ] node #1652 ( MUL_MAT): ffn_gate-45 ( 60K) [CUDA0 ]: blk.45.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-45 ( 15K) [CUDA0 ] node #1653 ( UNARY): ffn_gelu-45 ( 60K) [CUDA0 ]: ffn_gate-45 ( 60K) [CUDA0 ] node #1654 ( MUL_MAT): ffn_up-45 ( 60K) [CUDA0 ]: blk.45.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-45 ( 15K) [CUDA0 ] node #1655 ( MUL): ffn_gate_par-45 ( 60K) [CUDA0 ]: ffn_gelu-45 ( 60K) [CUDA0 ] ffn_up-45 ( 60K) [CUDA0 ] node #1656 ( MUL_MAT): ffn_out-45 ( 15K) [CUDA0 ]: blk.45.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-45 ( 60K) [CUDA0 ] node #1657 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-45 ( 15K) [CUDA0 ] node #1658 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.45.post_ffw_norm ( 15K) [CUDA0 ] node #1659 ( ADD): l_out-45 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-45 ( 15K) [CUDA0 ] node #1660 ( RMS_NORM): norm-46 ( 15K) [CUDA0 ]: l_out-45 ( 15K) [CUDA0 ] node #1661 ( MUL): attn_norm-46 ( 15K) [CUDA0 ]: norm-46 ( 15K) [CUDA0 ] blk.46.attn_norm.wei ( 15K) [CUDA0 ] node #1662 ( MUL_MAT): Qcur-46 ( 16K) [CUDA0 ]: blk.46.attn_q.weight ( 8M) [CUDA0 ] attn_norm-46 ( 15K) [CUDA0 ] node #1664 ( RMS_NORM): norm-46 ( 16K) [CUDA0 ]: Qcur-46 (reshaped) ( 16K) [CUDA0 ] node #1665 ( MUL): Qcur_normed-46 ( 16K) [CUDA0 ]: norm-46 ( 16K) [CUDA0 ] blk.46.attn_q_norm.w ( 1K) [CUDA0 ] node #1666 ( ROPE): Qcur-46 ( 16K) [CUDA0 ]: Qcur_normed-46 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1667 ( MUL_MAT): Kcur-46 ( 8K) [CUDA0 ]: blk.46.attn_k.weight ( 4M) [CUDA0 ] attn_norm-46 ( 15K) [CUDA0 ] node #1669 ( RMS_NORM): norm-46 ( 8K) [CUDA0 ]: Kcur-46 (reshaped) ( 8K) [CUDA0 ] node #1670 ( MUL): Kcur_normed-46 ( 8K) [CUDA0 ]: norm-46 ( 8K) [CUDA0 ] blk.46.attn_k_norm.w ( 1K) [CUDA0 ] node #1671 ( ROPE): Kcur-46 ( 8K) [CUDA0 ]: Kcur_normed-46 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1672 ( MUL_MAT): Vcur-46 ( 8K) [CUDA0 ]: blk.46.attn_v.weight ( 6M) [CUDA0 ] attn_norm-46 ( 15K) [CUDA0 ] node #1674 ( CPY): k_cache_view-46 (cop ( 2K) [CUDA0 ]: Kcur-46 ( 8K) [CUDA0 ] k_cache_view-46 ( 2K) [CUDA0 ] node #1676 ( CPY): v_cache_view-46 (cop ( 2K) [CUDA0 ]: Vcur-46 ( 8K) [CUDA0 ] v_cache_view-46 ( 2K) [CUDA0 ]
SPLIT #94: CPU # 3 inputs: [q-46 ( 16K)] [k-46 ( 544K)] [v-46 ( 544K)]
node #1680 (FLASH_ATTN): node_1680 ( 16K) [ CPU ]: CPU#q-46#0 ( 16K) [ NULL ] CPU#k-46#0 ( 544K) [ NULL ] CPU#v-46#0 ( 544K) [ NULL ] CPU#KQ_mask_swa (cop ( 32K) [ NULL ]
SPLIT #95: CUDA0 # 1 inputs: [ (reshaped) ( 16K)]
node #1682 ( MUL_MAT): kqv_out-46 ( 15K) [CUDA0 ]: blk.46.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1683 ( RMS_NORM): norm-46 ( 15K) [CUDA0 ]: kqv_out-46 ( 15K) [CUDA0 ] node #1684 ( MUL): attn_post_norm-46 ( 15K) [CUDA0 ]: norm-46 ( 15K) [CUDA0 ] blk.46.post_attentio ( 15K) [CUDA0 ] node #1685 ( ADD): sa_out-46 ( 15K) [CUDA0 ]: attn_post_norm-46 ( 15K) [CUDA0 ] l_out-45 ( 15K) [CUDA0 ] node #1686 ( RMS_NORM): norm-46 ( 15K) [CUDA0 ]: sa_out-46 ( 15K) [CUDA0 ] node #1687 ( MUL): ffn_norm-46 ( 15K) [CUDA0 ]: norm-46 ( 15K) [CUDA0 ] blk.46.ffn_norm.weig ( 15K) [CUDA0 ] node #1688 ( MUL_MAT): ffn_gate-46 ( 60K) [CUDA0 ]: blk.46.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-46 ( 15K) [CUDA0 ] node #1689 ( UNARY): ffn_gelu-46 ( 60K) [CUDA0 ]: ffn_gate-46 ( 60K) [CUDA0 ] node #1690 ( MUL_MAT): ffn_up-46 ( 60K) [CUDA0 ]: blk.46.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-46 ( 15K) [CUDA0 ] node #1691 ( MUL): ffn_gate_par-46 ( 60K) [CUDA0 ]: ffn_gelu-46 ( 60K) [CUDA0 ] ffn_up-46 ( 60K) [CUDA0 ] node #1692 ( MUL_MAT): ffn_out-46 ( 15K) [CUDA0 ]: blk.46.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-46 ( 60K) [CUDA0 ] node #1693 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-46 ( 15K) [CUDA0 ] node #1694 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.46.post_ffw_norm ( 15K) [CUDA0 ] node #1695 ( ADD): l_out-46 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-46 ( 15K) [CUDA0 ] node #1696 ( RMS_NORM): norm-47 ( 15K) [CUDA0 ]: l_out-46 ( 15K) [CUDA0 ] node #1697 ( MUL): attn_norm-47 ( 15K) [CUDA0 ]: norm-47 ( 15K) [CUDA0 ] blk.47.attn_norm.wei ( 15K) [CUDA0 ] node #1698 ( MUL_MAT): Qcur-47 ( 16K) [CUDA0 ]: blk.47.attn_q.weight ( 8M) [CUDA0 ] attn_norm-47 ( 15K) [CUDA0 ] node #1700 ( RMS_NORM): norm-47 ( 16K) [CUDA0 ]: Qcur-47 (reshaped) ( 16K) [CUDA0 ] node #1701 ( MUL): Qcur_normed-47 ( 16K) [CUDA0 ]: norm-47 ( 16K) [CUDA0 ] blk.47.attn_q_norm.w ( 1K) [CUDA0 ] node #1702 ( ROPE): Qcur-47 ( 16K) [CUDA0 ]: Qcur_normed-47 ( 16K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1703 ( MUL_MAT): Kcur-47 ( 8K) [CUDA0 ]: blk.47.attn_k.weight ( 4M) [CUDA0 ] attn_norm-47 ( 15K) [CUDA0 ] node #1705 ( RMS_NORM): norm-47 ( 8K) [CUDA0 ]: Kcur-47 (reshaped) ( 8K) [CUDA0 ] node #1706 ( MUL): Kcur_normed-47 ( 8K) [CUDA0 ]: norm-47 ( 8K) [CUDA0 ] blk.47.attn_k_norm.w ( 1K) [CUDA0 ] node #1707 ( ROPE): Kcur-47 ( 8K) [CUDA0 ]: Kcur_normed-47 ( 8K) [CUDA0 ] CUDA0#inp_pos#0 ( 0K) [ NULL ] node #1708 ( MUL_MAT): Vcur-47 ( 8K) [CUDA0 ]: blk.47.attn_v.weight ( 6M) [CUDA0 ] attn_norm-47 ( 15K) [CUDA0 ] node #1710 ( CPY): k_cache_view-47 (cop ( 2K) [CUDA0 ]: Kcur-47 ( 8K) [CUDA0 ] k_cache_view-47 ( 2K) [CUDA0 ] node #1712 ( CPY): v_cache_view-47 (cop ( 2K) [CUDA0 ]: Vcur-47 ( 8K) [CUDA0 ] v_cache_view-47 ( 2K) [CUDA0 ]
SPLIT #96: CPU # 3 inputs: [q-47 ( 16K)] [k-47 ( 544K)] [v-47 ( 544K)]
node #1716 (FLASH_ATTN): node_1716 ( 16K) [ CPU ]: CPU#q-47#0 ( 16K) [ NULL ] CPU#k-47#0 ( 544K) [ NULL ] CPU#v-47#0 ( 544K) [ NULL ] CPU#KQ_mask (copy)#0 ( 32K) [ NULL ]
SPLIT #97: CUDA0 # 2 inputs: [ (reshaped) ( 16K)] [inp_out_ids ( 0K)]
node #1718 ( MUL_MAT): kqv_out-47 ( 15K) [CUDA0 ]: blk.47.attn_output.w ( 8M) [CUDA0 ] CUDA0# (reshaped)#0 ( 16K) [ NULL ] node #1719 ( RMS_NORM): norm-47 ( 15K) [CUDA0 ]: kqv_out-47 ( 15K) [CUDA0 ] node #1720 ( MUL): attn_post_norm-47 ( 15K) [CUDA0 ]: norm-47 ( 15K) [CUDA0 ] blk.47.post_attentio ( 15K) [CUDA0 ] node #1721 ( GET_ROWS): node_1721 ( 15K) [CUDA0 ]: attn_post_norm-47 ( 15K) [CUDA0 ] CUDA0#inp_out_ids#0 ( 0K) [ NULL ] node #1722 ( GET_ROWS): node_1722 ( 15K) [CUDA0 ]: l_out-46 ( 15K) [CUDA0 ] CUDA0#inp_out_ids#0 ( 0K) [ NULL ] node #1723 ( ADD): sa_out-47 ( 15K) [CUDA0 ]: node_1721 ( 15K) [CUDA0 ] node_1722 ( 15K) [CUDA0 ] node #1724 ( RMS_NORM): norm-47 ( 15K) [CUDA0 ]: sa_out-47 ( 15K) [CUDA0 ] node #1725 ( MUL): ffn_norm-47 ( 15K) [CUDA0 ]: norm-47 ( 15K) [CUDA0 ] blk.47.ffn_norm.weig ( 15K) [CUDA0 ] node #1726 ( MUL_MAT): ffn_gate-47 ( 60K) [CUDA0 ]: blk.47.ffn_gate.weig ( 31M) [CUDA0 ] ffn_norm-47 ( 15K) [CUDA0 ] node #1727 ( UNARY): ffn_gelu-47 ( 60K) [CUDA0 ]: ffn_gate-47 ( 60K) [CUDA0 ] node #1728 ( MUL_MAT): ffn_up-47 ( 60K) [CUDA0 ]: blk.47.ffn_up.weight ( 31M) [CUDA0 ] ffn_norm-47 ( 15K) [CUDA0 ] node #1729 ( MUL): ffn_gate_par-47 ( 60K) [CUDA0 ]: ffn_gelu-47 ( 60K) [CUDA0 ] ffn_up-47 ( 60K) [CUDA0 ] node #1730 ( MUL_MAT): ffn_out-47 ( 15K) [CUDA0 ]: blk.47.ffn_down.weig ( 46M) [CUDA0 ] ffn_gate_par-47 ( 60K) [CUDA0 ] node #1731 ( RMS_NORM): norm ( 15K) [CUDA0 ]: ffn_out-47 ( 15K) [CUDA0 ] node #1732 ( MUL): ffn_post_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] blk.47.post_ffw_norm ( 15K) [CUDA0 ] node #1733 ( ADD): l_out-47 ( 15K) [CUDA0 ]: ffn_post_norm ( 15K) [CUDA0 ] sa_out-47 ( 15K) [CUDA0 ] node #1734 ( RMS_NORM): norm ( 15K) [CUDA0 ]: l_out-47 ( 15K) [CUDA0 ] node #1735 ( MUL): result_norm ( 15K) [CUDA0 ]: norm ( 15K) [CUDA0 ] output_norm.weight ( 15K) [CUDA0 ] node #1736 ( MUL_MAT): result_output ( 1M) [CUDA0 ]: token_embd.weight ( 787M) [CUDA0 ] result_norm ( 15K) [CUDA0 ] srv send: sending result for task id = 7 srv send: task id = 7 pushed to result queue slot process_toke: id 0 | task 7 | stopped by EOS slot process_toke: id 0 | task 7 | n_decoded = 22, n_remaining = -1, next token: 106 '' slot release: id 0 | task 7 | stop processing: n_past = 38, truncated = 0 slot print_timing: id 0 | task 7 | prompt eval time = 75.86 ms / 13 tokens ( 5.84 ms per token, 171.38 tokens per second) eval time = 945.03 ms / 22 tokens ( 42.96 ms per token, 23.28 tokens per second) total time = 1020.88 ms / 35 tokens srv send: sending result for task id = 7 srv send: task id = 7 pushed to result queue srv update_slots: run slots completed que start_loop: waiting for new tasks que start_loop: processing new tasks que start_loop: processing task, id = 29 que start_loop: update slots srv update_slots: all slots are idle que start_loop: waiting for new tasks data stream, to_send: data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":""}}],"created":1741785946,"id":"chatcmpl-9Qb2GeBPXWugyHyvaWal9u90NNwaReFo","model":"gpt-3.5-turbo","system_fingerprint":"b0-unknown","object":"chat.completion.chunk"}
data stream, to_send: data: {"choices":[{"finish_reason":"stop","index":0,"delta":{}}],"created":1741785946,"id":"chatcmpl-9Qb2GeBPXWugyHyvaWal9u90NNwaReFo","model":"gpt-3.5-turbo","system_fingerprint":"b0-unknown","object":"chat.completion.chunk","usage":{"completion_tokens":22,"prompt_tokens":17,"total_tokens":39},"timings":{"prompt_n":13,"prompt_ms":75.857,"prompt_per_token_ms":5.835153846153846,"prompt_per_second":171.37508733538104,"predicted_n":22,"predicted_ms":945.028,"predicted_per_token_ms":42.95581818181818,"predicted_per_second":23.279733510541487}}
srv log_server_r: request: POST /v1/chat/completions 127.0.0.1 200 srv log_server_r: request: {"messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Hello"}],"stream":true,"cache_prompt":true,"samplers":"edkypmxt","temperature":0.8,"dynatemp_range":0,"dynatemp_exponent":1,"top_k":40,"top_p":0.95,"min_p":0.05,"typical_p":1,"xtc_probability":0,"xtc_threshold":0.1,"repeat_last_n":64,"repeat_penalty":1,"presence_penalty":0,"frequency_penalty":0,"dry_multiplier":0,"dry_base":1.75,"dry_allowed_length":2,"dry_penalty_last_n":-1,"max_tokens":-1,"timings_per_token":false} srv log_server_r: response: srv remove_waiti: remove task 7 from waiting list. current waiting = 1 (before remove)
So it is the flash attention. This is probably because this head size (256) is only supported with F16. Not sure if this is because it is not commonly used, or there is some performance issue that makes it unusable, @JohannesGaessler should know more. You should still be able to use K quantization with flash attention disabled.
Without the flash attention flag it does not load at all unfortunately. The command:
./bin/llama-server -m '/home/luis/Downloads/gemma-3-12b-it-Q4_K_M.gguf' --n-gpu-layers -1 --batch_size 1024 --cache-type-k q8_0 --cache-type-v q8_0 -c 8000 --port 7777 -t 8 -ngl 99 -v
the logs:
...
load: control token: 259067 '
Without flash attn you can only quantize K, but not V. You need to remove the --cache-type-v q8_0 option.
I had encounter the same problem, seems it auto kv offload even without -nkvo option
Models with the same parameter count run significantly faster, and even models with a larger parameter count perform better than Gemma 3.
Related issue: GitHub Issue #9701
try -fa -ctk q4_0 -ctv q4_0
changes nothing, same output
I'm seeing the same on my older P6000.
Same case, while I can run 12b models easily, gemma3 12b gets its cache offloaded. And not having v cache quantized is not an option for low vram situations. If I use --flash-attn -ctk q4_0 -ctv q4_0 the prompt processing is done in CPU. If I user --flash-attn -ctk q4_0 the promt processing is offloaded but the vram consumption skyrockets out of the gpu capacity and that is slow as molases.
Changing topic a bit, I am really grateful for all the comunity efforts on llama.cpp.
Inference speed double slow, when use q8_0 cache.
16 token/sec unquantized vs 8 token/sec with q8_0 kv cache.
In mistral nemo this ~5% slower.
Up! Very sad bug: fat context, but quantized is kills inference speed. =(
So it is the flash attention. This is probably because this head size (256) is only supported with F16. Not sure if this is because it is not commonly used, or there is some performance issue that makes it unusable, @JohannesGaessler should know more. You should still be able to use K quantization with flash attention disabled.
The problem is register pressure. Head size 256 needs more registers than head size 128 and a quantized KV cache also needs more registers than an FP16 KV cache. If you combine the two the current kernel simply runs out of registers and the performance is effectively unusable which is why the CUDA backend does not support it. The code would need to be specifically rewritten for that use case to make it usable.
The problem is register pressure. Head size 256 needs more registers than head size 128 and a quantized KV cache also needs more registers than an FP16 KV cache. If you combine the two the current kernel simply runs out of registers and the performance is effectively unusable which is why the CUDA backend does not support it. The code would need to be specifically rewritten for that use case to make it usable.
Thx for your answer!
This is still very much relevant.
This issue was closed because it has been inactive for 14 days since being marked as stale.
This is still very relevant. The issue appears on M3 Pro with 36GB RAM running gemma3:12b.
@Mondonno I don't observe a slowdown on my M4 Max. Post your numbers.
llama-bench -m gemma-3-12b-it-q8_0.gguf -fa 1 -p 512,2048 -ub 2048 -ctk q8_0 -ctv q8_0
| model | size | params | backend | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | f16 | f16 | 1 | pp512 | 447.00 ± 5.62 |
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | f16 | f16 | 1 | pp2048 | 362.49 ± 4.93 |
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | f16 | f16 | 1 | tg128 | 25.50 ± 0.04 |
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | q8_0 | q8_0 | 1 | pp512 | 449.50 ± 4.99 |
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | q8_0 | q8_0 | 1 | pp2048 | 366.29 ± 5.02 |
| gemma3 12B Q8_0 | 11.64 GiB | 11.77 B | Metal | q8_0 | q8_0 | 1 | tg128 | 25.34 ± 0.04 |
build: 55042b369 (6308)
Yep, I'm still having this issue compared to other models. Quantized kv cache on all my Gemma 3 models is significantly slower than other models with similar settings (e.g. Qwen3 kv cache population is similar to what I see for Gemma 3 unquantized kv)
q8_0 vs f16 for Gemma3 on RTX 3060 below on b6432.
PS C:\ai\programs> .\llama-b6432-bin-win-cuda-12.4-x64\llama-bench -m ..\models\unsloth\gemma-3\gemma-3-4b-it-UD-Q6_K_XL.gguf -fa 1 -p 512,2048 -ub 2048 -ctk q8_0,f16 -ctv q8_0,f16
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
load_backend: loaded CUDA backend from C:\ai\programs\llama-b6432-bin-win-cuda-12.4-x64\ggml-cuda.dll
load_backend: loaded RPC backend from C:\ai\programs\llama-b6432-bin-win-cuda-12.4-x64\ggml-rpc.dll
load_backend: loaded CPU backend from C:\ai\programs\llama-b6432-bin-win-cuda-12.4-x64\ggml-cpu-haswell.dll
| model | size | params | backend | ngl | n_ubatch | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | q8_0 | q8_0 | 1 | pp512 | 549.49 ± 9.91 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | q8_0 | q8_0 | 1 | pp2048 | 196.77 ± 3.91 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | q8_0 | q8_0 | 1 | tg128 | 50.56 ± 0.40 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | f16 | f16 | 1 | pp512 | 3593.71 ± 48.88 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | f16 | f16 | 1 | pp2048 | 3732.72 ± 21.12 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | CUDA,RPC | 99 | 2048 | f16 | f16 | 1 | tg128 | 71.20 ± 0.18 |
The issue is not specific to this model, it happens with gpt-oss-20b and most other models too. It doesn't happen with the Vulkan backend so it's something in the CUDA backend.
Notice the extreme degradation in pp performance from pp512 to pp2048 with q8_0 kv cache. Until recently (b6332 is the last set of benchmarks I have) that also used to happen at f16. Sometime after b6332 there has been a massive improvement to f16 flash attention in ROCm/CUDA but it looks like that was only done for f16 because q8_0 kv cache is still just as bad as before.
With the current build Vulkan is actually much faster than ROCm/CUDA for q8_0 kv cache.
ROCm 7.0.1 (same results with ROCm 6.4.3)
$ llama-bench -fa 1 -p 512,2048 -ctk q8_0,f16 -ctv q8_0,f16 -m gemma-3-4b-it-UD-Q6_K_XL.gguf -m gpt-oss-20b-mxfp4.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon RX 6800, gfx1030 (0x1030), VMM: no, Wave Size: 32
| model | size | params | backend | ngl | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | q8_0 | q8_0 | 1 | pp512 | 929.66 ± 6.15 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | q8_0 | q8_0 | 1 | pp2048 | 386.06 ± 5.02 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | q8_0 | q8_0 | 1 | tg128 | 65.59 ± 0.10 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | q8_0 | q8_0 | 1 | pp512 | 634.25 ± 14.74 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | q8_0 | q8_0 | 1 | pp2048 | 225.06 ± 0.83 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | q8_0 | q8_0 | 1 | tg128 | 76.80 ± 0.56 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | f16 | f16 | 1 | pp512 | 2250.42 ± 0.39 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | f16 | f16 | 1 | pp2048 | 2171.00 ± 0.61 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | ROCm | 99 | f16 | f16 | 1 | tg128 | 81.28 ± 0.01 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | f16 | f16 | 1 | pp512 | 2261.13 ± 6.80 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | f16 | f16 | 1 | pp2048 | 2157.78 ± 8.26 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | ROCm | 99 | f16 | f16 | 1 | tg128 | 98.18 ± 0.05 |
build: f432d8d8 (6521)
Vulkan (RADV 25.1.9)
$ llama-bench -fa 1 -p 512,2048 -ctk q8_0,f16 -ctv q8_0,f16 -m gemma-3-4b-it-UD-Q6_K_XL.gguf -m gpt-oss-20b-mxfp4.gguf
load_backend: loaded RPC backend from /home/xxx/.local/llama-cpp/bin/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 6800 (RADV NAVI21) (radv) | uma: 0 | fp16: 1 | bf16: 0 | warp size: 32 | shared memory: 65536 | int dot: 1 | matrix cores: none
load_backend: loaded Vulkan backend from /home/xxx/.local/llama-cpp/bin/libggml-vulkan.so
load_backend: loaded CPU backend from /home/xxx/.local/llama-cpp/bin/libggml-cpu-haswell.so
| model | size | params | backend | ngl | type_k | type_v | fa | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | pp512 | 1439.45 ± 0.69 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | pp2048 | 1330.21 ± 0.72 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | tg128 | 102.89 ± 0.03 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | pp512 | 1140.77 ± 12.97 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | pp2048 | 1102.88 ± 1.42 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | q8_0 | q8_0 | 1 | tg128 | 129.65 ± 0.20 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | f16 | f16 | 1 | pp512 | 1404.21 ± 0.92 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | f16 | f16 | 1 | pp2048 | 1234.45 ± 1.36 |
| gemma3 4B Q6_K | 3.32 GiB | 3.88 B | RPC,Vulkan | 99 | f16 | f16 | 1 | tg128 | 103.14 ± 0.08 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | f16 | f16 | 1 | pp512 | 1152.62 ± 6.63 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | f16 | f16 | 1 | pp2048 | 1116.14 ± 5.23 |
| gpt-oss 20B MXFP4 MoE | 11.27 GiB | 20.91 B | RPC,Vulkan | 99 | f16 | f16 | 1 | tg128 | 130.13 ± 0.15 |
build: f432d8d8 (6521)