openplayground
openplayground copied to clipboard
Fix #8, fix #41 Add llamacpp support
add basic llama.cpp support via abetlen/llama-cpp-python
+1 for using abetlen/llama-cpp-python
I think the devs are doing a very good job at supporting the latest llama.cpp
code base and providing python bindings
I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!
INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format = ggjt v1 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 512
llama_model_load_internal: n_embd = 5120
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 40
llama_model_load_internal: n_layer = 40
llama_model_load_internal: n_rot = 128
llama_model_load_internal: ftype = 3 (mostly Q4_1)
llama_model_load_internal: n_ff = 13824
llama_model_load_internal: n_parts = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 73.73 KB
llama_model_load_internal: mem required = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size = 400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE
I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!
INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]} INFO:server.lib.sseserver:LISTENING TO: inferences INFO:server.lib.sseserver:LISTENING INFO:server.app:Received inference request llama-local INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 - llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 3 (mostly Q4_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 11359.03 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local INFO:server.lib.api.inference:Done streaming SSE
I cannot found anything unusual from the log. Maybe you should add some log at server/lib/inference/init.py line line 623. And check if there is any problem with llama-cpp-python?
I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!
INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]} INFO:server.lib.sseserver:LISTENING TO: inferences INFO:server.lib.sseserver:LISTENING INFO:server.app:Received inference request llama-local INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 - llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin llama_model_load_internal: format = ggjt v1 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 512 llama_model_load_internal: n_embd = 5120 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 40 llama_model_load_internal: n_layer = 40 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 3 (mostly Q4_1) llama_model_load_internal: n_ff = 13824 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 13B llama_model_load_internal: ggml ctx size = 73.73 KB llama_model_load_internal: mem required = 11359.03 MB (+ 1608.00 MB per state) llama_init_from_file: kv self size = 400.00 MB AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local INFO:server.lib.api.inference:Done streaming SSE
Did your set the prompt template file for alpaca? Here is my as an example:
### Instruction:
{prompt}
### Response:
Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {}
formatting so I was just using the prompts that ship with llama.cpp
.
Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used
{}
formatting so I was just using the prompts that ship withllama.cpp
.
sure, I added some more in the Readme file.