ggllm.cpp
ggllm.cpp copied to clipboard
Metal support
ggml and llama.cpp support Metal, do Apple Silicon users need to use LLaMA.cpp or can they use gglm.cpp with Falcon?
I tried the following:
build: LLAMA_METAL=1 make falcon_main falcon_quantize falcon_perplexity
then run the model with: ./falcon_main -t 4 -ngl 100 -b 1 -m ../Models/WizardLM-Uncensored-Falcon-7B-GGML/wizardlm-7b-uncensored.ggccv1.q4_0.bin -enc -p "write a story about llamas"
It outputs:
main: build = 883 (2b487f2)
falcon.cpp: loading model from ../Models/WizardLM-Uncensored-Falcon-7B-GGML/wizardlm-7b-uncensored.ggccv1.q4_0.bin
falcon.cpp: file version 10
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| Info | format | n_vocab | n_bpe | n_ctx | n_embd | n_head ; kv | n_layer | falcon | ftype | n_ff |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
| | ggcc v1 | 65024 | 64784 | 2048 | 4544 | 71 ; 1 | 32 | 7; 7B | 2 | 18176 |
+---------------+------------+---------+---------+-------+--------+---------------+---------+--------+-------+--------+
falcon_model_load_internal: ggml ctx size = 0.00 MB (mmap size = 3872.00 MB)
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon.cpp: Special mode: Wizard-type finetuning - changing tensor shape
falcon_model_load_internal: mem required = 4196.81 MB (+ 48.00 MB per state)
[==================================================] 100% Tensors populated
falcon_context_prepare: Context falcon_main RAM buffers - key_val = 16.00 MB, Compute = 160.00 MB, Scratch 0 = 124.00 MB, Scratch 1 = 40.14 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Volumes/SanDisk/ggllm.cpp/ggml-metal.metal'
ggml_metal_init: loaded kernel_add 0x14160b850
ggml_metal_init: loaded kernel_mul 0x14160bf70
ggml_metal_init: loaded kernel_mul_row 0x14160c5a0
ggml_metal_init: loaded kernel_scale 0x14160cac0
ggml_metal_init: loaded kernel_silu 0x14160cfe0
ggml_metal_init: loaded kernel_relu 0x14160d500
ggml_metal_init: loaded kernel_gelu 0x14160da20
ggml_metal_init: loaded kernel_soft_max 0x14160e0d0
ggml_metal_init: loaded kernel_diag_mask_inf 0x14160e730
ggml_metal_init: loaded kernel_get_rows_f16 0x14160edb0
ggml_metal_init: loaded kernel_get_rows_q4_0 0x14160f430
ggml_metal_init: loaded kernel_get_rows_q4_1 0x14160fc20
ggml_metal_init: loaded kernel_get_rows_q2_k 0x1416102a0
ggml_metal_init: loaded kernel_get_rows_q3_k 0x141610920
ggml_metal_init: loaded kernel_get_rows_q4_k 0x141610fa0
ggml_metal_init: loaded kernel_get_rows_q5_k 0x141611620
ggml_metal_init: loaded kernel_get_rows_q6_k 0x141611ca0
ggml_metal_init: loaded kernel_rms_norm 0x141612350
ggml_metal_init: loaded kernel_norm 0x141612a00
ggml_metal_init: loaded kernel_mul_mat_f16_f32 0x1416133d0
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32 0x141613ab0
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32 0x141614190
ggml_metal_init: loaded kernel_mul_mat_q2_k_f32 0x141614870
ggml_metal_init: loaded kernel_mul_mat_q3_k_f32 0x1416150f0
ggml_metal_init: loaded kernel_mul_mat_q4_k_f32 0x1416157d0
ggml_metal_init: loaded kernel_mul_mat_q5_k_f32 0x141615eb0
ggml_metal_init: loaded kernel_mul_mat_q6_k_f32 0x141616590
ggml_metal_init: loaded kernel_rope 0x141617080
ggml_metal_init: loaded kernel_alibi_f32 0x141617940
ggml_metal_init: loaded kernel_cpy_f32_f16 0x1416181d0
ggml_metal_init: loaded kernel_cpy_f32_f32 0x141618a60
ggml_metal_init: loaded kernel_cpy_f16_f16 0x1416192f0
ggml_metal_add_buffer: allocated 'data ' buffer, size = 3874.44 MB
ggml_metal_add_buffer: allocated 'eval ' buffer, size = 160.00 MB
ggml_metal_add_buffer: allocated 'kv ' buffer, size = 48.02 MB
ggml_metal_add_buffer: allocated 'scr0 ' buffer, size = 124.00 MB
ggml_metal_add_buffer: allocated 'scr1 ' buffer, size = 40.14 MB
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| Syst. Info | AVX | AVX2 | AVX512 | AVX512_VBMI | AVX512_VNNI | FMA | NEON | ARM_FMA | F16C | FP16_VA | SIMD | BLAS | SSE3 | VSX |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
| 4/10 thrd | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
+------------+-----+------+--------+-------------+-------------+-----+------+---------+------+---------+------+------+------+-----+
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
| Sampling | rpt_n | rpt_p | prs_p | frq_p | top_k | tfs_z | top_p | typ_p | temp | miro | mir_lr | mir_ent |
+------------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+--------+---------+
| | 64 | 1.100 | 0.000 | 0.000 | 40 | 1.000 | 0.950 | 1.000 | 0.80 | 0 | 0.1000 | 5.00000 |
+============+=======+=======+=======+=======+=======+=======+-------+-------+------+------+--------+---------+
| Generation | Ctx | Batch | Keep | Prom. | Seed | Finetune | Stop |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
| | 2048 | 1 | 0 | 10 | 1692449979 | WIZARD | # 1 |
+------------+-------+-------+-------+-------+---------------+----------------------+------+
GGML_ASSERT: ggml-metal.m:530: ne02 == ne12
GGML_ASSERT: ggml-metal.m:530: ne02 == ne12
zsh: abort ./falcon_main -t 4 -ngl 100 -b 1 -m -enc -p "write a story about llamas"
Same issue here... I'll try to convert the model to other types.