server
server copied to clipboard
Stark Difference in GPU Usage of Triton Servers with Llama3 and Llama3.1 models
Description I have noticed that there was a huge difference in memory usage for runtime buffers and decoder for llama3 and llama3.1.
Triton Information What version of Triton are you using? 24.07
Are you using the Triton container or did you build it yourself? Built from source
To Reproduce Steps to reproduce the behavior.
- Build llama3.1-8B-Instruct TensorRT-LLM engine
- Load llama3.1-8B-Instruct as an ensemble model in the tritonserver
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
I have built an 8bit quantised llama3 engine and the below is the memory allocated to the various components:
Similarly, I have built a 4bit quantised llama3.1 engine and the memory allocated for the runtime buffer and decoder exploded to 3GB and almost 5GB respectively, although the engine size is smaller.
Expected behavior The memory allocated for the runtime buffer and decoder should be similar to that for the llama3 engine.