serve Llamacpp with cpp backend

trafficstars

Description

Benchmarking LLM deployment with CPP Backend

Setup and Test

Follow the instructions from README.md to set up the environment
Download the TheBloke/Llama-2-7B-Chat-GGML model.

cd serve/cpp/./test/resources/torchscript_model/llm/llm_handler
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin

and update the path of the model in the script here.

To control the number of parameters to be generated, update the max_context_size variable in the script to the desired number.

Note: In the next version, this step will be changed to read the llm path from config.

Run the build

cd serve/cpp
./builld.sh

Once the build is successful libllm_handler.so shared object file would be generated in serve/cpp/./test/resources/torchscript_model/llm/llm_handler folder.

Copy the dummy.pt file to the llm_handler folder.
Move to llm_handler folder and run the following command to generate mar file

torch-model-archiver --model-name llm --version 1.0 --serialized-file dummy.pt --handler libllm_handler:LlmHandler --runtime LSP

Move the llm.mar to model_store

mkdir model_store
mv llm.mar model_store/llm.mar

Create a new config.properties file and past the content.

default_response_timeout=300000

The default timeout is 120000. When the context size is 512, LLM generation takes more time to complete the request in the single gpu machine.

Start the torchserve

torchserve --start --ncs --ts-config config.properties --model-store model_store/

Register the model using curl command

curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"

Update the input in prompt.txt if needed and run

curl http://localhost:8080/predictions/llm -T prompt.txt

Type of change

Please delete options that are not relevant.

[ ] Bug fix (non-breaking change which fixes an issue)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[x] New feature (non-breaking change which adds functionality)
[ ] This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced. Please also list any relevant details for your test configuration.

[x] Test A Logs for Test A
[ ] Test B Logs for Test B

Checklist:

[x] Did you have fun?
[x] Have you added tests that prove your fix is effective or that this feature works?
[ ] Has code been commented, particularly in hard-to-understand areas?
[ ] Have you made corresponding changes to the documentation?

Aug 16 '23 18:08 shrinath-suresh

@mreso Thanks for your review comments. I have already addressed few of your comments - implementing destructor, batch processing, remove auto based on your previous comments in babyllama PR. Will address the remaining ones and let you know.

Sep 15 '23 09:09 shrinath-suresh

This feature was picked up in v0.10.0 task.

Mar 11 '24 22:03 lxning

serve serve copied to clipboard

Llamacpp with cpp backend

Description

Setup and Test

Type of change

Feature/Issue validation/testing

Checklist:

serve
serve copied to clipboard