Token loss for llama.tokenize() with mixed Chinese/English text
When calling llama.tokenize() from llama_cpp_dart on a mixed Chinese/English string, the returned token count is significantly smaller than the token count produced by llama-cpp-python using the same GGUF model and text.
For the test case below, Dart returns 48 tokens while Python returns 71 tokens. The Python output matches llama.cpp’s behavior, so it looks like the Dart binding is losing tokens somewhere.
Environment Platform: android
Architecture: arm64-v8a
llama_cpp_dart version: ^0.1.2+1
Model: bge-m3-q4_k_m.gguf
Dart code (llama_cpp_dart)
final contextParams = ContextParams()..nCtx = 2048;
final llama = Llama(
modelPath,
ModelParams(),
contextParams,
SamplerParams(),
true,
);
const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";
final tokens = llama.tokenize(
text,
true, // addBos
);
print("token count (Dart): ${tokens.length}");
print(tokens);
tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 3, 2] token count (Dart): 48
The last part of the English sentence does not appear to be fully tokenized.
Python code (llama-cpp-python, same GGUF)
from llama_cpp import Llama
model = Llama(
model_path="path/to/same/model.gguf",
embedding=True,
)
text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week."""
token_ids = model.tokenize(text.encode("utf-8"))
print("token count (Python):", len(token_ids))
print(token_ids)
Observed result (Python)
tokens: [0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 10660, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2] token count (Python): 71
When detokenized, Python returns the full original string.
Expected behavior llama.tokenize() in Dart should return the same token sequence length as llama-cpp-python (and llama.cpp) for the same GGUF model and UTF‑8 input text, i.e. around 71 tokens for this test case, with no loss of the tail part of the string.
Additional notes Changing nCtx does not affect the Dart token count.
The discrepancy appears only in Dart; Python behaves as expected with the same model and text.
This is user-facing in embedding / RAG scenarios where correct token counts are important.
@cwleungar thank you so much for the report, I have fixed the issue with 1.2 push, there is some breaking changes, be careful
Thank you for the quick fix. However, after I updated to use the GitHub repo at commit 17978ee (instead of the version on pub.dev), I now get:
‘Llama init failed: LlamaException: Could not load model at /data/user/0/app name/app_flutter/Embedding.gguf I/flutter (32173): #0 Llama._initializeLlama (package:llama_cpp_dart/src/llama.dart:232:9).’
Are there any breaking changes in this commit compared to the pub.dev version that affect how the model path is resolved or how embedding GGUF models are loaded? Do I need to change anything in my configuration or model file (e.g., path format, file placement under app_flutter, or GGUF version) to load Embedding.gguf successfully?
here code and exact llama.cpp I built, also run and output, can you try to replicate?
(base) adel@adels-MacBook-Pro llama_cpp_dart % git submodule status src/llama.cpp b8595b16e69e3029e06be3b8f6635f9812b2bc3f src/llama.cpp (gguf-v0.17.1-1293-gb8595b16e) (base) adel@adels-MacBook-Pro llama_cpp_dart % more neo/bug.tokens.dart // ignore_for_file: avoid_print
import 'dart:io'; import 'dart:async';
import 'package:llama_cpp_dart/llama_cpp_dart.dart';
void main() async {
// Library path setup Llama.libraryPath = "/Users/adel/Workspace/llama_cpp_dart/bin/MAC_ARM64/libmtmd.dylib";
ContextParams contextParams = ContextParams(); contextParams.embeddings = true;
final llama = Llama( "/Users/adel/Downloads/bge-m3-q4_k_m.gguf", modelParams: ModelParams(), contextParams: contextParams, samplerParams: SamplerParams(), verbose: true, );
const text = """一旦您通過了筆試和路試,考官將拿走你的暫准駕駛執照 (P牌) (綠色)。你的正式駕駛執照 (粉紅色) 將通過郵寄寄到你的家中。They say it can take up to three weeks, but the full licence normally comes within a week.""";
final tokens = llama.tokenize( text, true, );
print("token count (Dart): ${tokens.length}");
print(tokens);
}
(base) adel@adels-MacBook-Pro llama_cpp_dart % dart neo/bug.tokens.dart
ggml_metal_device_init: tensor API disabled for pre-M5 device
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.016 sec
ggml_metal_device_init: GPU name: Apple M1 Max
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002)
ggml_metal_device_init: simdgroup reduction = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory = true
ggml_metal_device_init: has bfloat = true
ggml_metal_device_init: has tensor = false
ggml_metal_device_init: use residency sets = true
ggml_metal_device_init: use shared buffers = true
ggml_metal_device_init: recommendedMaxWorkingSetSize = 26800.60 MB
llama_model_load_from_file_impl: using device Metal (Apple M1 Max) (unknown id) - 25558 MiB free
llama_model_loader: loaded meta data with 32 key-value pairs and 389 tensors from /Users/adel/Downloads/bge-m3-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = bert
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 567M
llama_model_loader: - kv 3: general.license str = mit
llama_model_loader: - kv 4: general.tags arr[str,4] = ["sentence-transformers", "feature-ex...
llama_model_loader: - kv 5: bert.block_count u32 = 24
llama_model_loader: - kv 6: bert.context_length u32 = 8192
llama_model_loader: - kv 7: bert.embedding_length u32 = 1024
llama_model_loader: - kv 8: bert.feed_forward_length u32 = 4096
llama_model_loader: - kv 9: bert.attention.head_count u32 = 16
llama_model_loader: - kv 10: bert.attention.layer_norm_epsilon f32 = 0.000010
llama_model_loader: - kv 11: bert.attention.causal bool = false
llama_model_loader: - kv 12: bert.pooling_type u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.model str = t5
llama_model_loader: - kv 14: tokenizer.ggml.pre str = default
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,250002] = ["", "", "' is not marked as EOG
load: control token: 2 '' is not marked as EOG
load: control token: 1 ''
print_info: EOS token = 2 ''
print_info: UNK token = 3 ''
print_info: EOG token = 2 ''
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer 0 assigned to device Metal, is_swa = 0
load_tensors: layer 1 assigned to device Metal, is_swa = 0
load_tensors: layer 2 assigned to device Metal, is_swa = 0
load_tensors: layer 3 assigned to device Metal, is_swa = 0
load_tensors: layer 4 assigned to device Metal, is_swa = 0
load_tensors: layer 5 assigned to device Metal, is_swa = 0
load_tensors: layer 6 assigned to device Metal, is_swa = 0
load_tensors: layer 7 assigned to device Metal, is_swa = 0
load_tensors: layer 8 assigned to device Metal, is_swa = 0
load_tensors: layer 9 assigned to device Metal, is_swa = 0
load_tensors: layer 10 assigned to device Metal, is_swa = 0
load_tensors: layer 11 assigned to device Metal, is_swa = 0
load_tensors: layer 12 assigned to device Metal, is_swa = 0
load_tensors: layer 13 assigned to device Metal, is_swa = 0
load_tensors: layer 14 assigned to device Metal, is_swa = 0
load_tensors: layer 15 assigned to device Metal, is_swa = 0
load_tensors: layer 16 assigned to device Metal, is_swa = 0
load_tensors: layer 17 assigned to device Metal, is_swa = 0
load_tensors: layer 18 assigned to device Metal, is_swa = 0
load_tensors: layer 19 assigned to device Metal, is_swa = 0
load_tensors: layer 20 assigned to device Metal, is_swa = 0
load_tensors: layer 21 assigned to device Metal, is_swa = 0
load_tensors: layer 22 assigned to device Metal, is_swa = 0
load_tensors: layer 23 assigned to device Metal, is_swa = 0
load_tensors: layer 24 assigned to device Metal, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor token_types.weight
create_tensor: loading tensor position_embd.weight
create_tensor: loading tensor token_embd_norm.weight
create_tensor: loading tensor token_embd_norm.bias
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_output.bias
create_tensor: loading tensor blk.0.attn_output_norm.weight
create_tensor: loading tensor blk.0.attn_output_norm.bias
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.0.ffn_up.bias
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_down.bias
create_tensor: loading tensor blk.0.layer_output_norm.weight
create_tensor: loading tensor blk.0.layer_output_norm.bias
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_output.bias
create_tensor: loading tensor blk.1.attn_output_norm.weight
create_tensor: loading tensor blk.1.attn_output_norm.bias
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.1.ffn_up.bias
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_down.bias
create_tensor: loading tensor blk.1.layer_output_norm.weight
create_tensor: loading tensor blk.1.layer_output_norm.bias
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_output.bias
create_tensor: loading tensor blk.2.attn_output_norm.weight
create_tensor: loading tensor blk.2.attn_output_norm.bias
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.2.ffn_up.bias
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_down.bias
create_tensor: loading tensor blk.2.layer_output_norm.weight
create_tensor: loading tensor blk.2.layer_output_norm.bias
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_output.bias
create_tensor: loading tensor blk.3.attn_output_norm.weight
create_tensor: loading tensor blk.3.attn_output_norm.bias
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.3.ffn_up.bias
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_down.bias
create_tensor: loading tensor blk.3.layer_output_norm.weight
create_tensor: loading tensor blk.3.layer_output_norm.bias
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_output.bias
create_tensor: loading tensor blk.4.attn_output_norm.weight
create_tensor: loading tensor blk.4.attn_output_norm.bias
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.4.ffn_up.bias
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_down.bias
create_tensor: loading tensor blk.4.layer_output_norm.weight
create_tensor: loading tensor blk.4.layer_output_norm.bias
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_output.bias
create_tensor: loading tensor blk.5.attn_output_norm.weight
create_tensor: loading tensor blk.5.attn_output_norm.bias
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.5.ffn_up.bias
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_down.bias
create_tensor: loading tensor blk.5.layer_output_norm.weight
create_tensor: loading tensor blk.5.layer_output_norm.bias
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_output.bias
create_tensor: loading tensor blk.6.attn_output_norm.weight
create_tensor: loading tensor blk.6.attn_output_norm.bias
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.6.ffn_up.bias
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_down.bias
create_tensor: loading tensor blk.6.layer_output_norm.weight
create_tensor: loading tensor blk.6.layer_output_norm.bias
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_output.bias
create_tensor: loading tensor blk.7.attn_output_norm.weight
create_tensor: loading tensor blk.7.attn_output_norm.bias
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.7.ffn_up.bias
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_down.bias
create_tensor: loading tensor blk.7.layer_output_norm.weight
create_tensor: loading tensor blk.7.layer_output_norm.bias
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_output.bias
create_tensor: loading tensor blk.8.attn_output_norm.weight
create_tensor: loading tensor blk.8.attn_output_norm.bias
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.8.ffn_up.bias
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_down.bias
create_tensor: loading tensor blk.8.layer_output_norm.weight
create_tensor: loading tensor blk.8.layer_output_norm.bias
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_output.bias
create_tensor: loading tensor blk.9.attn_output_norm.weight
create_tensor: loading tensor blk.9.attn_output_norm.bias
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.9.ffn_up.bias
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_down.bias
create_tensor: loading tensor blk.9.layer_output_norm.weight
create_tensor: loading tensor blk.9.layer_output_norm.bias
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_output.bias
create_tensor: loading tensor blk.10.attn_output_norm.weight
create_tensor: loading tensor blk.10.attn_output_norm.bias
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.10.ffn_up.bias
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_down.bias
create_tensor: loading tensor blk.10.layer_output_norm.weight
create_tensor: loading tensor blk.10.layer_output_norm.bias
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_output.bias
create_tensor: loading tensor blk.11.attn_output_norm.weight
create_tensor: loading tensor blk.11.attn_output_norm.bias
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.11.ffn_up.bias
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_down.bias
create_tensor: loading tensor blk.11.layer_output_norm.weight
create_tensor: loading tensor blk.11.layer_output_norm.bias
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_output.bias
create_tensor: loading tensor blk.12.attn_output_norm.weight
create_tensor: loading tensor blk.12.attn_output_norm.bias
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.12.ffn_up.bias
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_down.bias
create_tensor: loading tensor blk.12.layer_output_norm.weight
create_tensor: loading tensor blk.12.layer_output_norm.bias
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_output.bias
create_tensor: loading tensor blk.13.attn_output_norm.weight
create_tensor: loading tensor blk.13.attn_output_norm.bias
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.13.ffn_up.bias
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_down.bias
create_tensor: loading tensor blk.13.layer_output_norm.weight
create_tensor: loading tensor blk.13.layer_output_norm.bias
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_output.bias
create_tensor: loading tensor blk.14.attn_output_norm.weight
create_tensor: loading tensor blk.14.attn_output_norm.bias
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.14.ffn_up.bias
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_down.bias
create_tensor: loading tensor blk.14.layer_output_norm.weight
create_tensor: loading tensor blk.14.layer_output_norm.bias
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_output.bias
create_tensor: loading tensor blk.15.attn_output_norm.weight
create_tensor: loading tensor blk.15.attn_output_norm.bias
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.15.ffn_up.bias
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_down.bias
create_tensor: loading tensor blk.15.layer_output_norm.weight
create_tensor: loading tensor blk.15.layer_output_norm.bias
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_output.bias
create_tensor: loading tensor blk.16.attn_output_norm.weight
create_tensor: loading tensor blk.16.attn_output_norm.bias
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.16.ffn_up.bias
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_down.bias
create_tensor: loading tensor blk.16.layer_output_norm.weight
create_tensor: loading tensor blk.16.layer_output_norm.bias
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_output.bias
create_tensor: loading tensor blk.17.attn_output_norm.weight
create_tensor: loading tensor blk.17.attn_output_norm.bias
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.17.ffn_up.bias
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_down.bias
create_tensor: loading tensor blk.17.layer_output_norm.weight
create_tensor: loading tensor blk.17.layer_output_norm.bias
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_output.bias
create_tensor: loading tensor blk.18.attn_output_norm.weight
create_tensor: loading tensor blk.18.attn_output_norm.bias
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.18.ffn_up.bias
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_down.bias
create_tensor: loading tensor blk.18.layer_output_norm.weight
create_tensor: loading tensor blk.18.layer_output_norm.bias
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_output.bias
create_tensor: loading tensor blk.19.attn_output_norm.weight
create_tensor: loading tensor blk.19.attn_output_norm.bias
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.19.ffn_up.bias
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_down.bias
create_tensor: loading tensor blk.19.layer_output_norm.weight
create_tensor: loading tensor blk.19.layer_output_norm.bias
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_output.bias
create_tensor: loading tensor blk.20.attn_output_norm.weight
create_tensor: loading tensor blk.20.attn_output_norm.bias
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.20.ffn_up.bias
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_down.bias
create_tensor: loading tensor blk.20.layer_output_norm.weight
create_tensor: loading tensor blk.20.layer_output_norm.bias
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_output.bias
create_tensor: loading tensor blk.21.attn_output_norm.weight
create_tensor: loading tensor blk.21.attn_output_norm.bias
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.21.ffn_up.bias
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_down.bias
create_tensor: loading tensor blk.21.layer_output_norm.weight
create_tensor: loading tensor blk.21.layer_output_norm.bias
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_output.bias
create_tensor: loading tensor blk.22.attn_output_norm.weight
create_tensor: loading tensor blk.22.attn_output_norm.bias
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.22.ffn_up.bias
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_down.bias
create_tensor: loading tensor blk.22.layer_output_norm.weight
create_tensor: loading tensor blk.22.layer_output_norm.bias
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_output.bias
create_tensor: loading tensor blk.23.attn_output_norm.weight
create_tensor: loading tensor blk.23.attn_output_norm.bias
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.23.ffn_up.bias
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_down.bias
create_tensor: loading tensor blk.23.layer_output_norm.weight
create_tensor: loading tensor blk.23.layer_output_norm.bias
ggml_metal_log_allocated_size: allocated buffer, size = 178.70 MiB, ( 179.08 / 25559.05)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors: CPU_Mapped model buffer size = 232.28 MiB
load_tensors: Metal_Mapped model buffer size = 178.69 MiB
..............................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 512
llama_context: n_ctx_seq = 512
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = disabled
llama_context: kv_unified = false
llama_context: freq_base = 10000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (512) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Max
ggml_metal_init: picking default device: Apple M1 Max
ggml_metal_init: use fusion = true
ggml_metal_init: use concurrency = true
ggml_metal_init: use graph optimize = true
set_abort_callback: call
llama_context: CPU output buffer size = 0.96 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3112
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
graph_reserve: reserving a graph for ubatch with n_tokens = 1, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens = 512, n_seqs = 1, n_outputs = 512
llama_context: Metal compute buffer size = 27.00 MiB
llama_context: CPU compute buffer size = 5.01 MiB
llama_context: graph nodes = 851
llama_context: graph splits = 2
token count (Dart): 72
[0, 6, 36247, 3479, 20057, 274, 23804, 12324, 264, 3136, 12324, 4, 15922, 13641, 2332, 11790, 3469, 6906, 125037, 27883, 123157, 47808, 8988, 15, 683, 13768, 16, 15, 95828, 14242, 6906, 12622, 123157, 47808, 8988, 15, 14210, 98196, 16, 6, 2332, 20057, 55047, 23636, 23636, 789, 6906, 79423, 30, 3957, 53, 5154, 442, 831, 5646, 1257, 47, 17262, 40859, 4, 1284, 70, 4393, 132954, 3638, 538, 32497, 28032, 10, 5895, 5, 2]
(base) adel@adels-MacBook-Pro llama_cpp_dart %
Hi, thank you again for your help. I tested the code on macOS ARM64, Android, and Windows x86. On macOS it works as expected and matches your output. However, on Android it still throws a null pointer when using the GitHub version, and on Windows it shows:
llama_model_load_from_file_impl: no backends are loaded. hint: use ggml_backend_load() or ggml_backend_load_all() to load a backend before calling this function
which then results in the same LlamaException as on Android:
LlamaException: Failed to initialize Llama (LlamaException: Could not load model at bge-m3-q4_k_m.gguf).
I tested the pub.dev 0.1.2+1 version with the updated tokenize code, and that works fine, so I will use this for now. I hope this information helps you continue developing the library. If you would like more details, I am happy to help with further testing.
yes, I understand the issue, for android, the llama.cpp likely does not have GPU support, you need to do modelParams.mainGpu = -1; to force CPU only, I guess Windows have the same issue