openplayground Fix #8, fix #41 Add llamacpp support

add basic llama.cpp support via abetlen/llama-cpp-python

Apr 04 '23 12:04 mmyjona

Apr 13 '23 09:04 mmyjona

+1 for using abetlen/llama-cpp-python

I think the devs are doing a very good job at supporting the latest llama.cpp code base and providing python bindings

Apr 13 '23 10:04 ggerganov

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

Apr 16 '23 20:04 ThatcherC

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

I cannot found anything unusual from the log. Maybe you should add some log at server/lib/inference/init.py line line 623. And check if there is any problem with llama-cpp-python?

Apr 17 '23 02:04 mmyjona

I'm unable to get inference going with the branch - any tips? I've managed to get it load the alpaca models I have, and it clearly starts up llama.cpp in the background, but the only tokens I get back are single "\n" characters. Here's a bit of the log that shows inference occurring in case that helps!

INFO:server.lib.api.inference:Path: /api/inference/text/stream, Request: {'prompt': 'What is the capital of france?\n\n', 'models': [{'name': 'llama-local:alpaca-13b', 'tag': 'llama-local:alpaca-13b', 'capabilities': [], 'provider': 'llama-local', 'parameters': {'temperature': 0.95, 'maximumLength': 58, 'topP': 1, 'repetitionPenalty': 1, 'stopSequences': ['Question:', 'User:', 'Bob:', 'Joke:', '### ']}, 'enabled': True, 'selected': True}]}
INFO:server.lib.sseserver:LISTENING TO: inferences
INFO:server.lib.sseserver:LISTENING
INFO:server.app:Received inference request llama-local
INFO:server.lib.inference:Requesting inference from alpaca-13b on llama-local
INFO:werkzeug:192.168.1.243 - - [16/Apr/2023 20:30:24] "POST /api/inference/text/stream HTTP/1.1" 200 -
llama.cpp: loading model from /local-stuff/models/alpaca/alpaca-lora-13B-ggml/ggml-model-q4_1.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 11359.03 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  =  400.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:server.lib.inference:Completed inference for alpaca-13b on llama-local
INFO:server.lib.api.inference:Done streaming SSE

Did your set the prompt template file for alpaca? Here is my as an example:

### Instruction:
{prompt}

### Response:

Apr 17 '23 12:04 mmyjona

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

Apr 17 '23 13:04 ThatcherC

Aha, that prompt did the trick! Thanks! Do you have example prompts for other models? I didn't realize they used {} formatting so I was just using the prompts that ship with llama.cpp.

sure, I added some more in the Readme file.

Apr 17 '23 14:04 mmyjona

openplayground openplayground copied to clipboard

Fix #8, fix #41 Add llamacpp support

openplayground
openplayground copied to clipboard