llama_cpp_dart icon indicating copy to clipboard operation
llama_cpp_dart copied to clipboard

Token loss for llama.tokenize() with mixed Chinese/English text

Open cwleungar opened this issue 1 month ago • 5 comments

When calling llama.tokenize() from llama_cpp_dart on a mixed Chinese/English string, the returned token count is significantly smaller than the token count produced by llama-cpp-python using the same GGUF model and text.

For the test case below, Dart returns 48 tokens while Python returns 71 tokens. The Python output matches llama.cpp’s behavior, so it looks like the Dart binding is losing tokens somewhere.

Environment Platform: android

Architecture: arm64-v8a

llama_cpp_dart version: ^0.1.2+1

Model: bge-m3-q4_k_m.gguf

Dart code (llama_cpp_dart)

final contextParams = ContextParams()..nCtx = 2048;

final llama = Llama(
  modelPath,
  ModelParams(),
  contextParams,
  SamplerParams(),
  true,
);

const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";

final tokens = llama.tokenize(
  text,
  true, // addBos
);

print("token count (Dart): ${tokens.length}");
print(tokens);

tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 3, 2] token count (Dart): 48

The last part of the English sentence does not appear to be fully tokenized.

Python code (llama-cpp-python, same GGUF)

from llama_cpp import Llama

model = Llama(
    model_path="path/to/same/model.gguf",
    embedding=True,
)

text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week."""

token_ids = model.tokenize(text.encode("utf-8"))

print("token count (Python):", len(token_ids))
print(token_ids)

Observed result (Python)

tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 10660, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2] token count (Python): 71

When detokenized, Python returns the full original string.

Expected behavior llama.tokenize() in Dart should return the same token sequence length as llama-cpp-python (and llama.cpp) for the same GGUF model and UTF‑8 input text, i.e. around 71 tokens for this test case, with no loss of the tail part of the string.

Additional notes Changing nCtx does not affect the Dart token count.

The discrepancy appears only in Dart; Python behaves as expected with the same model and text.

This is user-facing in embedding / RAG scenarios where correct token counts are important.

cwleungar avatar Nov 24 '25 09:11 cwleungar

@cwleungar thank you so much for the report, I have fixed the issue with 1.2 push, there is some breaking changes, be careful

netdur avatar Nov 24 '25 18:11 netdur

Thank you for the quick fix. However, after I updated to use the GitHub repo at commit 17978ee (instead of the version on pub.dev), I now get:

‘Llama init failed: LlamaException: Could not load model at /data/user/0/app name/app_flutter/Embedding.gguf I/flutter (32173): #0 Llama._initializeLlama (package:llama_cpp_dart/src/llama.dart:232:9).’

Are there any breaking changes in this commit compared to the pub.dev version that affect how the model path is resolved or how embedding GGUF models are loaded? Do I need to change anything in my configuration or model file (e.g., path format, file placement under app_flutter, or GGUF version) to load Embedding.gguf successfully?

cwleungar avatar Nov 25 '25 09:11 cwleungar

here code and exact llama.cpp I built, also run and output, can you try to replicate?

(base) adel@adels-MacBook-Pro llama_cpp_dart % git submodule status src/llama.cpp b8595b16e69e3029e06be3b8f6635f9812b2bc3f src/llama.cpp (gguf-v0.17.1-1293-gb8595b16e) (base) adel@adels-MacBook-Pro llama_cpp_dart % more neo/bug.tokens.dart // ignore_for_file: avoid_print

import 'dart:io'; import 'dart:async';

import 'package:llama_cpp_dart/llama_cpp_dart.dart';

void main() async {

// Library path setup Llama.libraryPath = "/Users/adel/Workspace/llama_cpp_dart/bin/MAC_ARM64/libmtmd.dylib";

ContextParams contextParams = ContextParams(); contextParams.embeddings = true;

final llama = Llama( "/Users/adel/Downloads/bge-m3-q4_k_m.gguf", modelParams: ModelParams(), contextParams: contextParams, samplerParams: SamplerParams(), verbose: true, );

const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";

final tokens = llama.tokenize( text, true, );

print("token count (Dart): ${tokens.length}"); print(tokens); } (base) adel@adels-MacBook-Pro llama_cpp_dart % dart neo/bug.tokens.dart ggml_metal_device_init: tensor API disabled for pre-M5 device ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.016 sec ggml_metal_device_init: GPU name: Apple M1 Max ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: has tensor = false ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB llama_model_load_from_file_impl: using device Metal (Apple M1 Max) (unknown id) - 25558 MiB free llama_model_loader: loaded meta data with 32 key-value pairs and 389 tensors from /Users/adel/Downloads/bge-m3-q4_k_m.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.size_label str = 567M llama_model_loader: - kv 3: general.license str = mit llama_model_loader: - kv 4: general.tags arr[str,4] = ["sentence-transformers", "feature-ex... llama_model_loader: - kv 5: bert.block_count u32 = 24 llama_model_loader: - kv 6: bert.context_length u32 = 8192 llama_model_loader: - kv 7: bert.embedding_length u32 = 1024 llama_model_loader: - kv 8: bert.feed_forward_length u32 = 4096 llama_model_loader: - kv 9: bert.attention.head_count u32 = 16 llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 11: bert.attention.causal bool = false llama_model_loader: - kv 12: bert.pooling_type u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.model str = t5 llama_model_loader: - kv 14: tokenizer.ggml.pre str = default llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,250002] = ["", "", "", "", ","... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,250002] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,250002] = [3, 3, 3, 2, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 19: tokenizer.ggml.token_type_count u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.remove_extra_whitespaces bool = true llama_model_loader: - kv 21: tokenizer.ggml.precompiled_charsmap arr[u8,237539] = [0, 180, 2, 0, 0, 132, 0, 0, 0, 0, 0,... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 0 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 25: tokenizer.ggml.seperator_token_id u32 = 2 llama_model_loader: - kv 26: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 27: tokenizer.ggml.mask_token_id u32 = 250001 llama_model_loader: - kv 28: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 29: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 30: general.quantization_version u32 = 2 llama_model_loader: - kv 31: general.file_type u32 = 15 llama_model_loader: - type f32: 244 tensors llama_model_loader: - type q4_K: 120 tensors llama_model_loader: - type q6_K: 25 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 410.97 MiB (6.08 BPW) init_tokenizer: initializing tokenizer for type 4 load: model vocab missing newline token, using special_pad_id instead load: control token: 0 '' is not marked as EOG load: control token: 2 '' is not marked as EOG load: control token: 1 '' is not marked as EOG load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: printing all EOG tokens: load: - 2 ('') load: special tokens cache size = 4 load: token to piece cache size = 2.1668 MB print_info: arch = bert print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 1024 print_info: n_embd_inp = 1024 print_info: n_layer = 24 print_info: n_head = 16 print_info: n_head_kv = 16 print_info: n_rot = 64 print_info: n_swa = 0 print_info: is_swa_any = 0 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 1.0e-05 print_info: f_norm_rms_eps = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 4096 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: n_expert_groups = 0 print_info: n_group_used = 0 print_info: causal attn = 0 print_info: pooling type = 2 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: model type = 335M print_info: model params = 566.70 M print_info: general.name = n/a print_info: vocab type = UGM print_info: n_vocab = 250002 print_info: n_merges = 0 print_info: BOS token = 0 '' print_info: EOS token = 2 '' print_info: UNK token = 3 '' print_info: SEP token = 2 '' print_info: PAD token = 1 '' print_info: MASK token = 250001 '[PAD250000]' print_info: LF token = 0 '' print_info: EOG token = 2 '' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device Metal, is_swa = 0 load_tensors: layer 1 assigned to device Metal, is_swa = 0 load_tensors: layer 2 assigned to device Metal, is_swa = 0 load_tensors: layer 3 assigned to device Metal, is_swa = 0 load_tensors: layer 4 assigned to device Metal, is_swa = 0 load_tensors: layer 5 assigned to device Metal, is_swa = 0 load_tensors: layer 6 assigned to device Metal, is_swa = 0 load_tensors: layer 7 assigned to device Metal, is_swa = 0 load_tensors: layer 8 assigned to device Metal, is_swa = 0 load_tensors: layer 9 assigned to device Metal, is_swa = 0 load_tensors: layer 10 assigned to device Metal, is_swa = 0 load_tensors: layer 11 assigned to device Metal, is_swa = 0 load_tensors: layer 12 assigned to device Metal, is_swa = 0 load_tensors: layer 13 assigned to device Metal, is_swa = 0 load_tensors: layer 14 assigned to device Metal, is_swa = 0 load_tensors: layer 15 assigned to device Metal, is_swa = 0 load_tensors: layer 16 assigned to device Metal, is_swa = 0 load_tensors: layer 17 assigned to device Metal, is_swa = 0 load_tensors: layer 18 assigned to device Metal, is_swa = 0 load_tensors: layer 19 assigned to device Metal, is_swa = 0 load_tensors: layer 20 assigned to device Metal, is_swa = 0 load_tensors: layer 21 assigned to device Metal, is_swa = 0 load_tensors: layer 22 assigned to device Metal, is_swa = 0 load_tensors: layer 23 assigned to device Metal, is_swa = 0 load_tensors: layer 24 assigned to device Metal, is_swa = 0 create_tensor: loading tensor token_embd.weight create_tensor: loading tensor token_types.weight create_tensor: loading tensor position_embd.weight create_tensor: loading tensor token_embd_norm.weight create_tensor: loading tensor token_embd_norm.bias create_tensor: loading tensor blk.0.attn_q.weight create_tensor: loading tensor blk.0.attn_q.bias create_tensor: loading tensor blk.0.attn_k.weight create_tensor: loading tensor blk.0.attn_k.bias create_tensor: loading tensor blk.0.attn_v.weight create_tensor: loading tensor blk.0.attn_v.bias create_tensor: loading tensor blk.0.attn_output.weight create_tensor: loading tensor blk.0.attn_output.bias create_tensor: loading tensor blk.0.attn_output_norm.weight create_tensor: loading tensor blk.0.attn_output_norm.bias create_tensor: loading tensor blk.0.ffn_up.weight create_tensor: loading tensor blk.0.ffn_up.bias create_tensor: loading tensor blk.0.ffn_down.weight create_tensor: loading tensor blk.0.ffn_down.bias create_tensor: loading tensor blk.0.layer_output_norm.weight create_tensor: loading tensor blk.0.layer_output_norm.bias create_tensor: loading tensor blk.1.attn_q.weight create_tensor: loading tensor blk.1.attn_q.bias create_tensor: loading tensor blk.1.attn_k.weight create_tensor: loading tensor blk.1.attn_k.bias create_tensor: loading tensor blk.1.attn_v.weight create_tensor: loading tensor blk.1.attn_v.bias create_tensor: loading tensor blk.1.attn_output.weight create_tensor: loading tensor blk.1.attn_output.bias create_tensor: loading tensor blk.1.attn_output_norm.weight create_tensor: loading tensor blk.1.attn_output_norm.bias create_tensor: loading tensor blk.1.ffn_up.weight create_tensor: loading tensor blk.1.ffn_up.bias create_tensor: loading tensor blk.1.ffn_down.weight create_tensor: loading tensor blk.1.ffn_down.bias create_tensor: loading tensor blk.1.layer_output_norm.weight create_tensor: loading tensor blk.1.layer_output_norm.bias create_tensor: loading tensor blk.2.attn_q.weight create_tensor: loading tensor blk.2.attn_q.bias create_tensor: loading tensor blk.2.attn_k.weight create_tensor: loading tensor blk.2.attn_k.bias create_tensor: loading tensor blk.2.attn_v.weight create_tensor: loading tensor blk.2.attn_v.bias create_tensor: loading tensor blk.2.attn_output.weight create_tensor: loading tensor blk.2.attn_output.bias create_tensor: loading tensor blk.2.attn_output_norm.weight create_tensor: loading tensor blk.2.attn_output_norm.bias create_tensor: loading tensor blk.2.ffn_up.weight create_tensor: loading tensor blk.2.ffn_up.bias create_tensor: loading tensor blk.2.ffn_down.weight create_tensor: loading tensor blk.2.ffn_down.bias create_tensor: loading tensor blk.2.layer_output_norm.weight create_tensor: loading tensor blk.2.layer_output_norm.bias create_tensor: loading tensor blk.3.attn_q.weight create_tensor: loading tensor blk.3.attn_q.bias create_tensor: loading tensor blk.3.attn_k.weight create_tensor: loading tensor blk.3.attn_k.bias create_tensor: loading tensor blk.3.attn_v.weight create_tensor: loading tensor blk.3.attn_v.bias create_tensor: loading tensor blk.3.attn_output.weight create_tensor: loading tensor blk.3.attn_output.bias create_tensor: loading tensor blk.3.attn_output_norm.weight create_tensor: loading tensor blk.3.attn_output_norm.bias create_tensor: loading tensor blk.3.ffn_up.weight create_tensor: loading tensor blk.3.ffn_up.bias create_tensor: loading tensor blk.3.ffn_down.weight create_tensor: loading tensor blk.3.ffn_down.bias create_tensor: loading tensor blk.3.layer_output_norm.weight create_tensor: loading tensor blk.3.layer_output_norm.bias create_tensor: loading tensor blk.4.attn_q.weight create_tensor: loading tensor blk.4.attn_q.bias create_tensor: loading tensor blk.4.attn_k.weight create_tensor: loading tensor blk.4.attn_k.bias create_tensor: loading tensor blk.4.attn_v.weight create_tensor: loading tensor blk.4.attn_v.bias create_tensor: loading tensor blk.4.attn_output.weight create_tensor: loading tensor blk.4.attn_output.bias create_tensor: loading tensor blk.4.attn_output_norm.weight create_tensor: loading tensor blk.4.attn_output_norm.bias create_tensor: loading tensor blk.4.ffn_up.weight create_tensor: loading tensor blk.4.ffn_up.bias create_tensor: loading tensor blk.4.ffn_down.weight create_tensor: loading tensor blk.4.ffn_down.bias create_tensor: loading tensor blk.4.layer_output_norm.weight create_tensor: loading tensor blk.4.layer_output_norm.bias create_tensor: loading tensor blk.5.attn_q.weight create_tensor: loading tensor blk.5.attn_q.bias create_tensor: loading tensor blk.5.attn_k.weight create_tensor: loading tensor blk.5.attn_k.bias create_tensor: loading tensor blk.5.attn_v.weight create_tensor: loading tensor blk.5.attn_v.bias create_tensor: loading tensor blk.5.attn_output.weight create_tensor: loading tensor blk.5.attn_output.bias create_tensor: loading tensor blk.5.attn_output_norm.weight create_tensor: loading tensor blk.5.attn_output_norm.bias create_tensor: loading tensor blk.5.ffn_up.weight create_tensor: loading tensor blk.5.ffn_up.bias create_tensor: loading tensor blk.5.ffn_down.weight create_tensor: loading tensor blk.5.ffn_down.bias create_tensor: loading tensor blk.5.layer_output_norm.weight create_tensor: loading tensor blk.5.layer_output_norm.bias create_tensor: loading tensor blk.6.attn_q.weight create_tensor: loading tensor blk.6.attn_q.bias create_tensor: loading tensor blk.6.attn_k.weight create_tensor: loading tensor blk.6.attn_k.bias create_tensor: loading tensor blk.6.attn_v.weight create_tensor: loading tensor blk.6.attn_v.bias create_tensor: loading tensor blk.6.attn_output.weight create_tensor: loading tensor blk.6.attn_output.bias create_tensor: loading tensor blk.6.attn_output_norm.weight create_tensor: loading tensor blk.6.attn_output_norm.bias create_tensor: loading tensor blk.6.ffn_up.weight create_tensor: loading tensor blk.6.ffn_up.bias create_tensor: loading tensor blk.6.ffn_down.weight create_tensor: loading tensor blk.6.ffn_down.bias create_tensor: loading tensor blk.6.layer_output_norm.weight create_tensor: loading tensor blk.6.layer_output_norm.bias create_tensor: loading tensor blk.7.attn_q.weight create_tensor: loading tensor blk.7.attn_q.bias create_tensor: loading tensor blk.7.attn_k.weight create_tensor: loading tensor blk.7.attn_k.bias create_tensor: loading tensor blk.7.attn_v.weight create_tensor: loading tensor blk.7.attn_v.bias create_tensor: loading tensor blk.7.attn_output.weight create_tensor: loading tensor blk.7.attn_output.bias create_tensor: loading tensor blk.7.attn_output_norm.weight create_tensor: loading tensor blk.7.attn_output_norm.bias create_tensor: loading tensor blk.7.ffn_up.weight create_tensor: loading tensor blk.7.ffn_up.bias create_tensor: loading tensor blk.7.ffn_down.weight create_tensor: loading tensor blk.7.ffn_down.bias create_tensor: loading tensor blk.7.layer_output_norm.weight create_tensor: loading tensor blk.7.layer_output_norm.bias create_tensor: loading tensor blk.8.attn_q.weight create_tensor: loading tensor blk.8.attn_q.bias create_tensor: loading tensor blk.8.attn_k.weight create_tensor: loading tensor blk.8.attn_k.bias create_tensor: loading tensor blk.8.attn_v.weight create_tensor: loading tensor blk.8.attn_v.bias create_tensor: loading tensor blk.8.attn_output.weight create_tensor: loading tensor blk.8.attn_output.bias create_tensor: loading tensor blk.8.attn_output_norm.weight create_tensor: loading tensor blk.8.attn_output_norm.bias create_tensor: loading tensor blk.8.ffn_up.weight create_tensor: loading tensor blk.8.ffn_up.bias create_tensor: loading tensor blk.8.ffn_down.weight create_tensor: loading tensor blk.8.ffn_down.bias create_tensor: loading tensor blk.8.layer_output_norm.weight create_tensor: loading tensor blk.8.layer_output_norm.bias create_tensor: loading tensor blk.9.attn_q.weight create_tensor: loading tensor blk.9.attn_q.bias create_tensor: loading tensor blk.9.attn_k.weight create_tensor: loading tensor blk.9.attn_k.bias create_tensor: loading tensor blk.9.attn_v.weight create_tensor: loading tensor blk.9.attn_v.bias create_tensor: loading tensor blk.9.attn_output.weight create_tensor: loading tensor blk.9.attn_output.bias create_tensor: loading tensor blk.9.attn_output_norm.weight create_tensor: loading tensor blk.9.attn_output_norm.bias create_tensor: loading tensor blk.9.ffn_up.weight create_tensor: loading tensor blk.9.ffn_up.bias create_tensor: loading tensor blk.9.ffn_down.weight create_tensor: loading tensor blk.9.ffn_down.bias create_tensor: loading tensor blk.9.layer_output_norm.weight create_tensor: loading tensor blk.9.layer_output_norm.bias create_tensor: loading tensor blk.10.attn_q.weight create_tensor: loading tensor blk.10.attn_q.bias create_tensor: loading tensor blk.10.attn_k.weight create_tensor: loading tensor blk.10.attn_k.bias create_tensor: loading tensor blk.10.attn_v.weight create_tensor: loading tensor blk.10.attn_v.bias create_tensor: loading tensor blk.10.attn_output.weight create_tensor: loading tensor blk.10.attn_output.bias create_tensor: loading tensor blk.10.attn_output_norm.weight create_tensor: loading tensor blk.10.attn_output_norm.bias create_tensor: loading tensor blk.10.ffn_up.weight create_tensor: loading tensor blk.10.ffn_up.bias create_tensor: loading tensor blk.10.ffn_down.weight create_tensor: loading tensor blk.10.ffn_down.bias create_tensor: loading tensor blk.10.layer_output_norm.weight create_tensor: loading tensor blk.10.layer_output_norm.bias create_tensor: loading tensor blk.11.attn_q.weight create_tensor: loading tensor blk.11.attn_q.bias create_tensor: loading tensor blk.11.attn_k.weight create_tensor: loading tensor blk.11.attn_k.bias create_tensor: loading tensor blk.11.attn_v.weight create_tensor: loading tensor blk.11.attn_v.bias create_tensor: loading tensor blk.11.attn_output.weight create_tensor: loading tensor blk.11.attn_output.bias create_tensor: loading tensor blk.11.attn_output_norm.weight create_tensor: loading tensor blk.11.attn_output_norm.bias create_tensor: loading tensor blk.11.ffn_up.weight create_tensor: loading tensor blk.11.ffn_up.bias create_tensor: loading tensor blk.11.ffn_down.weight create_tensor: loading tensor blk.11.ffn_down.bias create_tensor: loading tensor blk.11.layer_output_norm.weight create_tensor: loading tensor blk.11.layer_output_norm.bias create_tensor: loading tensor blk.12.attn_q.weight create_tensor: loading tensor blk.12.attn_q.bias create_tensor: loading tensor blk.12.attn_k.weight create_tensor: loading tensor blk.12.attn_k.bias create_tensor: loading tensor blk.12.attn_v.weight create_tensor: loading tensor blk.12.attn_v.bias create_tensor: loading tensor blk.12.attn_output.weight create_tensor: loading tensor blk.12.attn_output.bias create_tensor: loading tensor blk.12.attn_output_norm.weight create_tensor: loading tensor blk.12.attn_output_norm.bias create_tensor: loading tensor blk.12.ffn_up.weight create_tensor: loading tensor blk.12.ffn_up.bias create_tensor: loading tensor blk.12.ffn_down.weight create_tensor: loading tensor blk.12.ffn_down.bias create_tensor: loading tensor blk.12.layer_output_norm.weight create_tensor: loading tensor blk.12.layer_output_norm.bias create_tensor: loading tensor blk.13.attn_q.weight create_tensor: loading tensor blk.13.attn_q.bias create_tensor: loading tensor blk.13.attn_k.weight create_tensor: loading tensor blk.13.attn_k.bias create_tensor: loading tensor blk.13.attn_v.weight create_tensor: loading tensor blk.13.attn_v.bias create_tensor: loading tensor blk.13.attn_output.weight create_tensor: loading tensor blk.13.attn_output.bias create_tensor: loading tensor blk.13.attn_output_norm.weight create_tensor: loading tensor blk.13.attn_output_norm.bias create_tensor: loading tensor blk.13.ffn_up.weight create_tensor: loading tensor blk.13.ffn_up.bias create_tensor: loading tensor blk.13.ffn_down.weight create_tensor: loading tensor blk.13.ffn_down.bias create_tensor: loading tensor blk.13.layer_output_norm.weight create_tensor: loading tensor blk.13.layer_output_norm.bias create_tensor: loading tensor blk.14.attn_q.weight create_tensor: loading tensor blk.14.attn_q.bias create_tensor: loading tensor blk.14.attn_k.weight create_tensor: loading tensor blk.14.attn_k.bias create_tensor: loading tensor blk.14.attn_v.weight create_tensor: loading tensor blk.14.attn_v.bias create_tensor: loading tensor blk.14.attn_output.weight create_tensor: loading tensor blk.14.attn_output.bias create_tensor: loading tensor blk.14.attn_output_norm.weight create_tensor: loading tensor blk.14.attn_output_norm.bias create_tensor: loading tensor blk.14.ffn_up.weight create_tensor: loading tensor blk.14.ffn_up.bias create_tensor: loading tensor blk.14.ffn_down.weight create_tensor: loading tensor blk.14.ffn_down.bias create_tensor: loading tensor blk.14.layer_output_norm.weight create_tensor: loading tensor blk.14.layer_output_norm.bias create_tensor: loading tensor blk.15.attn_q.weight create_tensor: loading tensor blk.15.attn_q.bias create_tensor: loading tensor blk.15.attn_k.weight create_tensor: loading tensor blk.15.attn_k.bias create_tensor: loading tensor blk.15.attn_v.weight create_tensor: loading tensor blk.15.attn_v.bias create_tensor: loading tensor blk.15.attn_output.weight create_tensor: loading tensor blk.15.attn_output.bias create_tensor: loading tensor blk.15.attn_output_norm.weight create_tensor: loading tensor blk.15.attn_output_norm.bias create_tensor: loading tensor blk.15.ffn_up.weight create_tensor: loading tensor blk.15.ffn_up.bias create_tensor: loading tensor blk.15.ffn_down.weight create_tensor: loading tensor blk.15.ffn_down.bias create_tensor: loading tensor blk.15.layer_output_norm.weight create_tensor: loading tensor blk.15.layer_output_norm.bias create_tensor: loading tensor blk.16.attn_q.weight create_tensor: loading tensor blk.16.attn_q.bias create_tensor: loading tensor blk.16.attn_k.weight create_tensor: loading tensor blk.16.attn_k.bias create_tensor: loading tensor blk.16.attn_v.weight create_tensor: loading tensor blk.16.attn_v.bias create_tensor: loading tensor blk.16.attn_output.weight create_tensor: loading tensor blk.16.attn_output.bias create_tensor: loading tensor blk.16.attn_output_norm.weight create_tensor: loading tensor blk.16.attn_output_norm.bias create_tensor: loading tensor blk.16.ffn_up.weight create_tensor: loading tensor blk.16.ffn_up.bias create_tensor: loading tensor blk.16.ffn_down.weight create_tensor: loading tensor blk.16.ffn_down.bias create_tensor: loading tensor blk.16.layer_output_norm.weight create_tensor: loading tensor blk.16.layer_output_norm.bias create_tensor: loading tensor blk.17.attn_q.weight create_tensor: loading tensor blk.17.attn_q.bias create_tensor: loading tensor blk.17.attn_k.weight create_tensor: loading tensor blk.17.attn_k.bias create_tensor: loading tensor blk.17.attn_v.weight create_tensor: loading tensor blk.17.attn_v.bias create_tensor: loading tensor blk.17.attn_output.weight create_tensor: loading tensor blk.17.attn_output.bias create_tensor: loading tensor blk.17.attn_output_norm.weight create_tensor: loading tensor blk.17.attn_output_norm.bias create_tensor: loading tensor blk.17.ffn_up.weight create_tensor: loading tensor blk.17.ffn_up.bias create_tensor: loading tensor blk.17.ffn_down.weight create_tensor: loading tensor blk.17.ffn_down.bias create_tensor: loading tensor blk.17.layer_output_norm.weight create_tensor: loading tensor blk.17.layer_output_norm.bias create_tensor: loading tensor blk.18.attn_q.weight create_tensor: loading tensor blk.18.attn_q.bias create_tensor: loading tensor blk.18.attn_k.weight create_tensor: loading tensor blk.18.attn_k.bias create_tensor: loading tensor blk.18.attn_v.weight create_tensor: loading tensor blk.18.attn_v.bias create_tensor: loading tensor blk.18.attn_output.weight create_tensor: loading tensor blk.18.attn_output.bias create_tensor: loading tensor blk.18.attn_output_norm.weight create_tensor: loading tensor blk.18.attn_output_norm.bias create_tensor: loading tensor blk.18.ffn_up.weight create_tensor: loading tensor blk.18.ffn_up.bias create_tensor: loading tensor blk.18.ffn_down.weight create_tensor: loading tensor blk.18.ffn_down.bias create_tensor: loading tensor blk.18.layer_output_norm.weight create_tensor: loading tensor blk.18.layer_output_norm.bias create_tensor: loading tensor blk.19.attn_q.weight create_tensor: loading tensor blk.19.attn_q.bias create_tensor: loading tensor blk.19.attn_k.weight create_tensor: loading tensor blk.19.attn_k.bias create_tensor: loading tensor blk.19.attn_v.weight create_tensor: loading tensor blk.19.attn_v.bias create_tensor: loading tensor blk.19.attn_output.weight create_tensor: loading tensor blk.19.attn_output.bias create_tensor: loading tensor blk.19.attn_output_norm.weight create_tensor: loading tensor blk.19.attn_output_norm.bias create_tensor: loading tensor blk.19.ffn_up.weight create_tensor: loading tensor blk.19.ffn_up.bias create_tensor: loading tensor blk.19.ffn_down.weight create_tensor: loading tensor blk.19.ffn_down.bias create_tensor: loading tensor blk.19.layer_output_norm.weight create_tensor: loading tensor blk.19.layer_output_norm.bias create_tensor: loading tensor blk.20.attn_q.weight create_tensor: loading tensor blk.20.attn_q.bias create_tensor: loading tensor blk.20.attn_k.weight create_tensor: loading tensor blk.20.attn_k.bias create_tensor: loading tensor blk.20.attn_v.weight create_tensor: loading tensor blk.20.attn_v.bias create_tensor: loading tensor blk.20.attn_output.weight create_tensor: loading tensor blk.20.attn_output.bias create_tensor: loading tensor blk.20.attn_output_norm.weight create_tensor: loading tensor blk.20.attn_output_norm.bias create_tensor: loading tensor blk.20.ffn_up.weight create_tensor: loading tensor blk.20.ffn_up.bias create_tensor: loading tensor blk.20.ffn_down.weight create_tensor: loading tensor blk.20.ffn_down.bias create_tensor: loading tensor blk.20.layer_output_norm.weight create_tensor: loading tensor blk.20.layer_output_norm.bias create_tensor: loading tensor blk.21.attn_q.weight create_tensor: loading tensor blk.21.attn_q.bias create_tensor: loading tensor blk.21.attn_k.weight create_tensor: loading tensor blk.21.attn_k.bias create_tensor: loading tensor blk.21.attn_v.weight create_tensor: loading tensor blk.21.attn_v.bias create_tensor: loading tensor blk.21.attn_output.weight create_tensor: loading tensor blk.21.attn_output.bias create_tensor: loading tensor blk.21.attn_output_norm.weight create_tensor: loading tensor blk.21.attn_output_norm.bias create_tensor: loading tensor blk.21.ffn_up.weight create_tensor: loading tensor blk.21.ffn_up.bias create_tensor: loading tensor blk.21.ffn_down.weight create_tensor: loading tensor blk.21.ffn_down.bias create_tensor: loading tensor blk.21.layer_output_norm.weight create_tensor: loading tensor blk.21.layer_output_norm.bias create_tensor: loading tensor blk.22.attn_q.weight create_tensor: loading tensor blk.22.attn_q.bias create_tensor: loading tensor blk.22.attn_k.weight create_tensor: loading tensor blk.22.attn_k.bias create_tensor: loading tensor blk.22.attn_v.weight create_tensor: loading tensor blk.22.attn_v.bias create_tensor: loading tensor blk.22.attn_output.weight create_tensor: loading tensor blk.22.attn_output.bias create_tensor: loading tensor blk.22.attn_output_norm.weight create_tensor: loading tensor blk.22.attn_output_norm.bias create_tensor: loading tensor blk.22.ffn_up.weight create_tensor: loading tensor blk.22.ffn_up.bias create_tensor: loading tensor blk.22.ffn_down.weight create_tensor: loading tensor blk.22.ffn_down.bias create_tensor: loading tensor blk.22.layer_output_norm.weight create_tensor: loading tensor blk.22.layer_output_norm.bias create_tensor: loading tensor blk.23.attn_q.weight create_tensor: loading tensor blk.23.attn_q.bias create_tensor: loading tensor blk.23.attn_k.weight create_tensor: loading tensor blk.23.attn_k.bias create_tensor: loading tensor blk.23.attn_v.weight create_tensor: loading tensor blk.23.attn_v.bias create_tensor: loading tensor blk.23.attn_output.weight create_tensor: loading tensor blk.23.attn_output.bias create_tensor: loading tensor blk.23.attn_output_norm.weight create_tensor: loading tensor blk.23.attn_output_norm.bias create_tensor: loading tensor blk.23.ffn_up.weight create_tensor: loading tensor blk.23.ffn_up.bias create_tensor: loading tensor blk.23.ffn_down.weight create_tensor: loading tensor blk.23.ffn_down.bias create_tensor: loading tensor blk.23.layer_output_norm.weight create_tensor: loading tensor blk.23.layer_output_norm.bias ggml_metal_log_allocated_size: allocated buffer, size = 178.70 MiB, ( 179.08 / 25559.05) load_tensors: offloading 24 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 25/25 layers to GPU load_tensors: CPU_Mapped model buffer size = 232.28 MiB load_tensors: Metal_Mapped model buffer size = 178.69 MiB .............................................. llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 512 llama_context: n_ctx_seq = 512 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 0 llama_context: flash_attn = disabled llama_context: kv_unified = false llama_context: freq_base = 10000.0 llama_context: freq_scale = 1 llama_context: n_ctx_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized ggml_metal_init: allocating ggml_metal_init: found device: Apple M1 Max ggml_metal_init: picking default device: Apple M1 Max ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true set_abort_callback: call llama_context: CPU output buffer size = 0.96 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 3 llama_context: max_nodes = 3112 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1 graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512 llama_context: Metal compute buffer size = 27.00 MiB llama_context: CPU compute buffer size = 5.01 MiB llama_context: graph nodes = 851 llama_context: graph splits = 2 token count (Dart): 72 [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 3957, 53, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2] (base) adel@adels-MacBook-Pro llama_cpp_dart %

netdur avatar Nov 25 '25 18:11 netdur

Hi, thank you again for your help. I tested the code on macOS ARM64, Android, and Windows x86. On macOS it works as expected and matches your output. However, on Android it still throws a null pointer when using the GitHub version, and on Windows it shows:

llama_model_load_from_file_impl: no backends are loaded. hint: use ggml_backend_load() or ggml_backend_load_all() to load a backend before calling this function

which then results in the same LlamaException as on Android:

LlamaException: Failed to initialize Llama (LlamaException: Could not load model at bge-m3-q4_k_m.gguf).

I tested the pub.dev 0.1.2+1 version with the updated tokenize code, and that works fine, so I will use this for now. I hope this information helps you continue developing the library. If you would like more details, I am happy to help with further testing.

cwleungar avatar Nov 26 '25 09:11 cwleungar

yes, I understand the issue, for android, the llama.cpp likely does not have GPU support, you need to do modelParams.mainGpu = -1; to force CPU only, I guess Windows have the same issue

netdur avatar Nov 26 '25 22:11 netdur