llamafile
llamafile copied to clipboard
Illegal Instruction when running a llamafile
Hi,
Issue:
I tried to run llava-v1.5-7b-q4.llamafile or TinyLlama-1.1B-Chat-v1.0.F16.llamafile on my system: Linux Ubuntu 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr 4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
But I encountered the same error at the same step for both:
stdout:
$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile
note: if you have an AMD or NVIDIA GPU then you need to pass -ngl 9999 to enable GPU offloading
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2856,"msg":"build info","tid":"11165056","timestamp":1715465433}
{"function":"server_cli","level":"INFO","line":2859,"msg":"system info","n_threads":4,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"11165056","timestamp":1715465433,"total_threads":4}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from TinyLlama-1.1B-Chat-v1.0.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: llama.block_count u32 = 22
llama_model_loader: - kv 2: llama.context_length u32 = 2048
llama_model_loader: - kv 3: llama.embedding_length u32 = 2048
llama_model_loader: - kv 4: llama.feed_forward_length u32 = 5632
llama_model_loader: - kv 5: llama.attention.head_count u32 = 32
llama_model_loader: - kv 6: llama.attention.head_count_kv u32 = 4
llama_model_loader: - kv 7: llama.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 8: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 9: general.file_type u32 = 1
llama_model_loader: - kv 10: llama.vocab_size u32 = 32000
llama_model_loader: - kv 11: llama.rope.dimension_count u32 = 64
llama_model_loader: - kv 12: tokenizer.ggml.model str = llama
llama_model_loader: - kv 13: tokenizer.ggml.pre str = default
llama_model_loader: - kv 14: tokenizer.ggml.tokens arr[str,32000] = ["", "", "<0x00>", "<...
llama_model_loader: - kv 15: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 17: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 18: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 2
llama_model_loader: - kv 21: tokenizer.chat_template str = {% for message in messages %}\n{% if m...
llama_model_loader: - kv 22: general.quantization_version u32 = 2
llama_model_loader: - type f32: 45 tensors
llama_model_loader: - type f16: 156 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 4
llm_load_print_meta: n_layer = 22
llm_load_print_meta: n_rot = 64
llm_load_print_meta: n_embd_head_k = 64
llm_load_print_meta: n_embd_head_v = 64
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 256
llm_load_print_meta: n_embd_v_gqa = 256
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5632
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 1.10 B
llm_load_print_meta: model size = 2.05 GiB (16.00 BPW)
llm_load_print_meta: general.name = n/a
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 '
llama.log content:
$ cat llama.log warming up the model with an empty run
lscpu
It seems to be CPU related, so here is my lscpu:
$ lscpu
Architecture : x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Address sizes: 36 bits physical, 48 bits virtual
Boutisme : Little Endian
Processeur(s) : 4
Liste de processeur(s) en ligne : 0-3
Identifiant constructeur : GenuineIntel
Nom de modèle : Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz
Famille de processeur : 6
Modèle : 42
Thread(s) par cœur : 1
Cœur(s) par socket : 4
Socket(s) : 1
Révision : 7
Vitesse maximale du processeur en MHz : 3700,0000
Vitesse minimale du processeur en MHz : 1600,0000
BogoMIPS : 6619.18
Drapaux : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ht tm pb
e syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni
pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt tsc_deadline_timer aes x
save avx lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow flexpriority ept vpid xsaveopt dtherm ida arat pln pts vnm
i md_clear flush_l1d
Virtualization features:
Virtualisation : VT-x
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 1 MiB (4 instances)
L3: 6 MiB (1 instance)
NUMA:
Nœud(s) NUMA : 1
Nœud NUMA 0 de processeur(s) : 0-3
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: KVM: Mitigation: VMX disabled
L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Mds: Mitigation; Clear CPU buffers; SMT disabled
Meltdown: Mitigation; PTI
Mmio stale data: Unknown: No mitigations
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
I saw a similar issue with a similar CPU: Support broken on old Intel/Amd CPUs #25. But as it does not crash at the same step, I was wondering if it could be related.
Last stdout lines with --ftrace flag:
$ ./TinyLlama-1.1B-Chat-v1.0.F16.llamafile --ftrace FUN 7143 7143 127'676'693'461 -123'127'225'490'312 &ggml_get_n_tasks.part.0 FUN 7143 7222 127'676'694'743 688 &ggml_get_n_tasks.part.0 FUN 7143 7223 127'676'695'076 1'088 &ggml_compute_forward_mul_mat FUN 7143 7224 127'676'695'958 688 &ggml_compute_forward FUN 7143 7143 127'676'697'768 -123'127'225'490'312 &ggml_compute_forward FUN 7143 7222 127'676'698'572 688 &ggml_compute_forward FUN 7143 7224 127'676'700'804 1'088 &ggml_compute_forward_mul_mat FUN 7143 7223 127'676'700'698 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'702'180 -123'127'225'489'912 &ggml_compute_forward_mul_mat FUN 7143 7222 127'676'703'139 1'088 &ggml_compute_forward_mul_mat FUN 7143 7224 127'676'704'676 1'712 &llamafile_sgemm_amd_avx FUN 7143 7223 127'676'705'968 1'632 &ggml_syncthreads FUN 7143 7143 127'676'707'146 -123'127'225'489'288 &llamafile_sgemm_amd_avx FUN 7143 7222 127'676'708'192 1'712 &llamafile_sgemm_amd_avx FUN 7143 7224 127'676'709'142 1'632 &ggml_syncthreads FUN 7143 7143 127'676'711'551 -123'127'225'489'368 &ggml_fp32_to_fp16_row_amd_avx FUN 7143 7222 127'676'712'329 1'632 &ggml_fp32_to_fp16_row_amd_avx FUN 7143 7223 127'676'718'666 1'696 &sched_yield FUN 7143 7224 127'676'722'670 1'696 &sched_yield FUN 7143 7143 127'676'722'489 -123'127'225'489'368 &ggml_syncthreads FUN 7143 7222 127'676'723'178 1'632 &ggml_syncthreads FUN 7143 7222 127'676'727'610 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'728'117 -123'127'225'489'288 &llamafile_sgemm_amd_avx FUN 7143 7222 127'676'731'582 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7223 127'676'733'826 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'736'033 -123'127'225'489'112 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7222 127'676'736'916 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7223 127'676'737'875 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7224 127'676'739'365 1'712 &llamafile_sgemm_amd_avx FUN 7143 7143 127'676'739'291 -123'127'225'489'032 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7223 127'676'741'568 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll FUN 7143 7224 127'676'742'748 1'888 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE6mnpackEllll FUN 7143 7224 127'676'746'386 1'968 &_ZN12_GLOBAL__N_18tinyBLASILi0ELi8EDv8_fS1_ttfE4gemmILi2ELi2ELi1EEEvllll Instruction non permise (core dumped)
OK you have a sandybridge CPU. Five years EOL but still supported by us. Could you run ./llava-v1.5-7b-q4.llamafile --version
and tell me what it says? It'd help to know what version of llamafile your llamafiles are.
Hi, Sure it's an old rig 😉 Sufficient for daily tasks, but outdated for modern AI experimentation...
Here are the information: $ ./llava-v1.5-7b-q4.llamafile --version llamafile v0.8.4
Note: I had to download APE / APE-jart and register them.
Same thing here
It seems to be a regression between version 0.7.0 and version 0.8.0 Reproduced with Xeon E5-2407 (sandybridge) [Everything is fine with Xeon® Silver 4108 (skylake)]
model | version | status |
---|---|---|
mistral-7b-instruct-v0.2.Q5_K_M.llamafile | llamafile v0.7.0 | OK |
mistral-7b-instruct-v0.2.Q4_0.llamafile | llamafile v0.8.0 | Illegal instruction (core dumped) |
I see what the issue is here. I've confirmed a fix is incoming.
Please be warned that once this fix goes live, using f16 weights on a sandybridge cpu that doesn't have the f16c isa, while it will no longer crash, it will almost certainly go very slow. You'll most likely be better served using the q4 weights on an older cpu.