iOS Exception: Could not load model at ...
Hi there,
First off, thanks for the hard work creating this package.
I am currently having some issues getting the package to run on iOS. I am currently loading both dynamic libraries libggml.dylib & libllama.dylib as such:
Llama.libraryPath = "libllama.dylib";
The issue comes when trying to load a model like so:
final model = 'ggml-vocab-gpt-2.gguf';
final directory = await getApplicationDocumentsDirectory();
final filePath = '${directory.path}/$model';
final fileExists = await File(filePath).exists();
if (!fileExists) {
final byteData = await rootBundle.load('assets/ai/$model');
final file = File(filePath);
await file.writeAsBytes(byteData.buffer
.asUint8List(byteData.offsetInBytes, byteData.lengthInBytes));
}
Llama llama = Llama(
filePath,
modelParams,
contextParams,
samplerParams,
);
even doing it as a raw path:
Llama llama = Llama(
/Users/Luke/Workspace/llama.cpp/models/tinyllama-2-1b-miniguanaco.Q3_K_L.gguf,
modelParams,
contextParams,
samplerParams,
);
also does not work.
We are using the latest dev branch (commit hash 231a3e84f9e8b3e92470654af8b48f3811b4ec06).
Any help or guidance here would be greatly appreciated.
Error:
Could not load model XYZ flutter: Error: LateInitializationError: Field 'context' has not been initialized.
Upon further debugging, here is a lovely xCode output:
llama_load_model_from_file: using device Metal (Apple A15 GPU) - 2727 MiB free llama_model_loader: loaded meta data with 16 key-value pairs and 0 tensors from /var/mobile/Containers/Data/Application/C44C7A73-3C3E-4778-953F-B3F8412A71BF/Documents/ggml-vocab-gpt-2.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gpt2 llama_model_loader: - kv 1: general.name str = gpt-2 llama_model_loader: - kv 2: gpt2.block_count u32 = 12 llama_model_loader: - kv 3: gpt2.context_length u32 = 1024 llama_model_loader: - kv 4: gpt2.embedding_length u32 = 768 llama_model_loader: - kv 5: gpt2.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: gpt2.attention.head_count u32 = 12 llama_model_loader: - kv 7: gpt2.attention.layer_norm_epsilon f32 = 0.000010 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 10: tokenizer.ggml.pre str = gpt-2 llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,50257] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,50257] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,50000] = ["Ġ t", "Ġ a", "h e", "i n", "r e",... llama_model_loader: - kv 14: tokenizer.ggml.bos_token_id u32 = 50256 llama_model_loader: - kv 15: tokenizer.ggml.eos_token_id u32 = 50256 llm_load_vocab: special tokens cache size = 1 llm_load_vocab: token to piece cache size = 0.3060 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = gpt2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 50257 llm_load_print_meta: n_merges = 50000 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024 llm_load_print_meta: n_embd = 768 llm_load_print_meta: n_layer = 12 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 64 llm_load_print_meta: n_embd_head_v = 64 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 768 llm_load_print_meta: n_embd_v_gqa = 768 llm_load_print_meta: f_norm_eps = 1.0e-05 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 3072 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = -1 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 0.1B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 0.00 K llm_load_print_meta: model size = 0.00 MiB (nan BPW) llm_load_print_meta: general.name = gpt-2 llm_load_print_meta: BOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOS token = 50256 '<|endoftext|>' llm_load_print_meta: EOT token = 50256 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 50256 '<|endoftext|>' llm_load_print_meta: max token length = 256 llm_load_tensors: ggml ctx size = 0.03 MiB llama_model_load: error loading model: check_tensor_dims: tensor 'token_embd.weight' not found llama_load_model_from_file: failed to load model
After even more investigation, it seems the model I was using was malformed. Using a new model, I now get the following errors:
/llama.cpp/src/llama-sampling.cpp:279: GGML_ASSERT(cur_p.selected >= 0 && cur_p.selected < (int32_t) cur_p.size) failed
Any idea on this?
Cheers!
Further updates, got past the above GGML_ASSERTS by modifying the model params and context params.
Hit another wall when it comes to encoding.
llama.cpp/src/llama.cpp:15342: GGML_ASSERT(n_outputs_enc > 0 && "call llama_encode() first") failed
Any ideas here?
@LukeMoody01 I could run tinyllama-2-1b-miniguanaco.Q3_K_L.gguf with script in example folder, both simple.dart and chat.dart --- could it be that your prompt is larger than context? I will try investigate also thank you for trying dev
@LukeMoody01 please try again
Hey @netdur,
I will give it a go using the model you just mentioned! I was using a flan T5 model.
I am also using iOS as well.
I'll get back to you soon.
Alright, so it "works".
The AI likes to cut off its response very early, but I feel like that could be a config issue on my end. How do you usually allow the A.I to have lengthier responses? @netdur
thanks, in llama class, I have predict field fixed at low value, this set length of output, I will expose it
On Wed, Nov 27, 2024, 01:34 Luke Moody @.***> wrote:
Alright, so it "works".
The AI likes to cut off its response very early, but I feel like that could be a config issue on my end. How do you usually allow the A.I to have lengthier responses? @netdur https://github.com/netdur
— Reply to this email directly, view it on GitHub https://github.com/netdur/llama_cpp_dart/issues/44#issuecomment-2502344420, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAODX2PCSGHMZ6KXUZ7GIU32CUHSNAVCNFSM6AAAAABSPLDD46VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKMBSGM2DINBSGA . You are receiving this because you were mentioned.Message ID: @.***>
That'd be awesome. Great work @netdur 😄
Can I also ask, where do you find the models you test with? As some of the models I find on huggingface usually throw errors such as the ones above, and "GGML_ASSERT(strcmp(res->name, "result_output") == 0 && "missing result_output tensor")"
tested those models
https://huggingface.co/TheBloke/Tinyllama-2-1b-miniguanaco-GGUF/blob/main/tinyllama-2-1b-miniguanaco.Q3_K_L.gguf
https://huggingface.co/mradermacher/Qwen2-7B-Multilingual-RP-GGUF/blob/main/Qwen2-7B-Multilingual-RP.Q8_0.gguf
https://huggingface.co/MaziyarPanahi/gemma-7b-GGUF/blob/main/gemma-7b.Q8_0.gguf
@LukeMoody01 I am currently testing on ios, how do you build llama.cpp for ios?