server icon indicating copy to clipboard operation
server copied to clipboard

Stark Difference in GPU Usage of Triton Servers with Llama3 and Llama3.1 models

Open jasonngap1 opened this issue 4 months ago • 0 comments

Description I have noticed that there was a huge difference in memory usage for runtime buffers and decoder for llama3 and llama3.1.

Triton Information What version of Triton are you using? 24.07

Are you using the Triton container or did you build it yourself? Built from source

To Reproduce Steps to reproduce the behavior.

  1. Build llama3.1-8B-Instruct TensorRT-LLM engine
  2. Load llama3.1-8B-Instruct as an ensemble model in the tritonserver

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

I have built an 8bit quantised llama3 engine and the below is the memory allocated to the various components: Image

Similarly, I have built a 4bit quantised llama3.1 engine and the memory allocated for the runtime buffer and decoder exploded to 3GB and almost 5GB respectively, although the engine size is smaller. Image

Expected behavior The memory allocated for the runtime buffer and decoder should be similar to that for the llama3 engine.

jasonngap1 avatar Oct 14 '24 01:10 jasonngap1