serve
serve copied to clipboard
Llamacpp with cpp backend
Description
Benchmarking LLM deployment with CPP Backend
Setup and Test
-
Follow the instructions from README.md to set up the environment
-
Download the TheBloke/Llama-2-7B-Chat-GGML model.
cd serve/cpp/./test/resources/torchscript_model/llm/llm_handler
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin
and update the path of the model in the script here.
To control the number of parameters to be generated, update the max_context_size variable in the script to the desired number.
Note: In the next version, this step will be changed to read the llm path from config.
- Run the build
cd serve/cpp
./builld.sh
Once the build is successful libllm_handler.so shared object file would be generated in serve/cpp/./test/resources/torchscript_model/llm/llm_handler folder.
- Copy the
dummy.ptfile to thellm_handlerfolder. - Move to
llm_handlerfolder and run the following command to generate mar file
torch-model-archiver --model-name llm --version 1.0 --serialized-file dummy.pt --handler libllm_handler:LlmHandler --runtime LSP
- Move the llm.mar to model_store
mkdir model_store
mv llm.mar model_store/llm.mar
- Create a new config.properties file and past the content.
default_response_timeout=300000
The default timeout is 120000. When the context size is 512, LLM generation takes more time to complete the request in the single gpu machine.
- Start the torchserve
torchserve --start --ncs --ts-config config.properties --model-store model_store/
- Register the model using curl command
curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"
- Update the input in
prompt.txtif needed and run
curl http://localhost:8080/predictions/llm -T prompt.txt
Type of change
Please delete options that are not relevant.
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [x] New feature (non-breaking change which adds functionality)
- [ ] This change requires a documentation update
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced. Please also list any relevant details for your test configuration.
-
[x] Test A Logs for Test A
-
[ ] Test B Logs for Test B
Checklist:
- [x] Did you have fun?
- [x] Have you added tests that prove your fix is effective or that this feature works?
- [ ] Has code been commented, particularly in hard-to-understand areas?
- [ ] Have you made corresponding changes to the documentation?
@mreso Thanks for your review comments. I have already addressed few of your comments - implementing destructor, batch processing, remove auto based on your previous comments in babyllama PR. Will address the remaining ones and let you know.
This feature was picked up in v0.10.0 task.