serve icon indicating copy to clipboard operation
serve copied to clipboard

Llamacpp with cpp backend

Open shrinath-suresh opened this issue 2 years ago • 1 comments
trafficstars

Description

Benchmarking LLM deployment with CPP Backend

Setup and Test

  1. Follow the instructions from README.md to set up the environment

  2. Download the TheBloke/Llama-2-7B-Chat-GGML model.

cd serve/cpp/./test/resources/torchscript_model/llm/llm_handler
wget https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/resolve/main/llama-2-7b-chat.ggmlv3.q4_0.bin

and update the path of the model in the script here.

To control the number of parameters to be generated, update the max_context_size variable in the script to the desired number.

Note: In the next version, this step will be changed to read the llm path from config.

  1. Run the build
cd serve/cpp
./builld.sh

Once the build is successful libllm_handler.so shared object file would be generated in serve/cpp/./test/resources/torchscript_model/llm/llm_handler folder.

  1. Copy the dummy.pt file to the llm_handler folder.
  2. Move to llm_handler folder and run the following command to generate mar file
torch-model-archiver --model-name llm --version 1.0 --serialized-file dummy.pt --handler libllm_handler:LlmHandler --runtime LSP
  1. Move the llm.mar to model_store
mkdir model_store
mv llm.mar model_store/llm.mar
  1. Create a new config.properties file and past the content.
default_response_timeout=300000

The default timeout is 120000. When the context size is 512, LLM generation takes more time to complete the request in the single gpu machine.

  1. Start the torchserve
torchserve --start --ncs --ts-config config.properties --model-store model_store/
  1. Register the model using curl command
curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar"
  1. Update the input in prompt.txt if needed and run
curl http://localhost:8080/predictions/llm -T prompt.txt

Type of change

Please delete options that are not relevant.

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced. Please also list any relevant details for your test configuration.

  • [x] Test A Logs for Test A

  • [ ] Test B Logs for Test B

Checklist:

  • [x] Did you have fun?
  • [x] Have you added tests that prove your fix is effective or that this feature works?
  • [ ] Has code been commented, particularly in hard-to-understand areas?
  • [ ] Have you made corresponding changes to the documentation?

shrinath-suresh avatar Aug 16 '23 18:08 shrinath-suresh

@mreso Thanks for your review comments. I have already addressed few of your comments - implementing destructor, batch processing, remove auto based on your previous comments in babyllama PR. Will address the remaining ones and let you know.

shrinath-suresh avatar Sep 15 '23 09:09 shrinath-suresh

This feature was picked up in v0.10.0 task.

lxning avatar Mar 11 '24 22:03 lxning