Chinese-LLaMA-Alpaca
Chinese-LLaMA-Alpaca copied to clipboard
请问模型启动一半卡住,是什么问题?
main: seed = 1681116321 llama_model_load: loading model from 'zh-models/7B/ggml-model-f16.bin' - please wait ... llama_model_load: n_vocab = 49954 llama_model_load: n_ctx = 2048 llama_model_load: n_embd = 4096 llama_model_load: n_mult = 256 llama_model_load: n_head = 32 llama_model_load: n_layer = 32 llama_model_load: n_rot = 128 llama_model_load: f16 = 1 llama_model_load: n_ff = 11008 llama_model_load: n_parts = 1 llama_model_load: type = 1 llama_model_load: ggml map size = 13134.21 MB llama_model_load: ggml ctx size = 81.25 KB llama_model_load: mem required = 14926.29 MB (+ 1026.00 MB per state) llama_model_load: loading tensors from 'zh-models/7B/ggml-model-f16.bin' llama_model_load: model size = 13133.55 MB / num tensors = 291 llama_init_from_file: kv self size = 1024.00 MB
system_info: n_threads = 80 / 80 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction:
' sampling: temp = 0.200000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 generate: n_ctx = 2048, n_batch = 8, n_predict = 256, n_keep = 21
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in ''.
Below is an instruction that describes a
使用的是./main -m zh-models/7B/ggml-model-q4_0.bin --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 --repeat_penalty 1.3这个命令,输出就卡在a这个地方,也没办法进行交互
嗯,有的时候是会这样,回车一下有时会解决。 另外,看你的log应该是上一版的llama.cpp,可以pull一下最新版重新编译试试,个人感觉启动比之前一版流畅一些。
更新了llama.cpp之后确实可以跑起来了,但是生成速度非常慢,可能5-10Min生成1个字,这是正常的情况吗?比如下面是运行了20分钟之后的结果 main: seed = 1681118045 llama.cpp: loading model from zh-models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 49954 llama_model_load_internal: n_ctx = 256 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: f16 = 2 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5896.99 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 128.00 MB
system_info: n_threads = 80 / 80 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction:
' sampling: temp = 0.200000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 generate: n_ctx = 256, n_batch = 8, n_predict = 256, n_keep = 21
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in ''.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
你是一个导游,请你介绍一下中国的名胜古迹 中国
更新了llama.cpp之后确实可以跑起来了,但是生成速度非常慢,可能5-10Min生成1个字,这是正常的情况吗?比如下面是运行了20分钟之后的结果 main: seed = 1681118045 llama.cpp: loading model from zh-models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 49954 llama_model_load_internal: n_ctx = 256 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: f16 = 2 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 59.11 KB llama_model_load_internal: mem required = 5896.99 MB (+ 1026.00 MB per state) llama_init_from_file: kv self size = 128.00 MB
system_info: n_threads = 80 / 80 | AVX = 1 | AVX2 = 1 | AVX512 = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | main: interactive mode on. Reverse prompt: '### Instruction:
' sampling: temp = 0.200000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000 generate: n_ctx = 256, n_batch = 8, n_predict = 256, n_keep = 21
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- Press Return to return control to LLaMa.
- If you want to submit another line, end your input in ''.
Below is an instruction that describes a task. Write a response that appropriately completes the request.
你是一个导游,请你介绍一下中国的名胜古迹 中国
看一下这个#51
我尝试了一下,速度还是很慢,请问程序是不是没有调用我的显卡,导致速度慢呢
我尝试了一下,速度还是很慢,请问程序是不是没有调用我的显卡,导致速度慢呢
llama.cpp不会调用GPU。如果是mac M系列芯片会比较快。 另外可以通过增加-b(batch size)和-t(线程数)来提速(实测不是很明显)。
我的系统是Linux debian 5.10.0-20-amd64 #1 SMP Debian 5.10.158-2 (2022-12-13) x86_64
在使用llama.cpp量化了模型,得到zh-models/7B/ggml-model-q4_0.bin,
此时使用main进行预测,非常慢!!!!!
./main -m zh-models/7B/ggml-model-f16.bin --color -f prompts/alpaca.txt -ins -c 2048 --temp 0.2 -n 256 -b 2048 --repeat_penalty 1.3
我使用-b指定了batch_size=2048,但输出的日志显示仍然是512,同时main的默认线程数是4,但是从输出日志来看,默认并不是4,而是将所有核都沾满(56/56)。
然后我显式指定了线程数4,程序就不再卡顿了
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your consideration.
Closing the issue, since no updates observed. Feel free to re-open if you need any further assistance.