TensorRT-LLM Drop in performance for Llama-2-13b-chat-hf in fp8 when increasing batch size

System Info

CPU architecture: x86_64
GPU: NVIDIA H100 80GB
TensorRT-LLM: v0.8.0 (docker build via make -C docker release_build CUDA_ARCHS="90-real") and 0.9.0.dev2024032600
Triton Inference Server: r24.02
OS: Ubuntu 22.04

Who can help?

@kaiyux

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I've followed the official documentation to create Llama models and run them with Triton. I'm testing fp8 and int8 quantization. https://github.com/NVIDIA/TensorRT-LLM/tree/v0.8.0/examples/llama

For fp8 model, I used the following commands:

python ../quantization/quantize.py --model_dir meta-llama/Llama-2-13b-chat-hf \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --output_dir /models/quant/Llama-2-13b-chat-hf_fp8 \
                                   --calib_size 512 \
                                   --tp_size 1

trtllm-build --checkpoint_dir /models/quant/Llama-2-13b-chat-hf_fp8 \
             --output_dir /models/engines/Llama-2-13b-chat-hf_1gpu_fp8 \
             --gemm_plugin float16 \
             --workers 1 \
             --use_custom_all_reduce disable \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --strongly_typed \
             --max_batch_size 256

For int8 model:

python3 convert_checkpoint.py --model_dir meta-llama/Llama-2-13b-chat-hf \
                              --output_dir /models/rt/Llama-2-13b-chat-hf_1gpu_fp16_wq8 \
                              --dtype float16 \
                              --use_weight_only \
                              --weight_only_precision int8 \
                              --tp_size 1 \
                              --workers 1

trtllm-build --checkpoint_dir /models/rt/Llama-2-13b-chat-hf_1gpu_fp16_wq8 \
             --output_dir /models/engines/Llama-2-13b-chat-hf_1gpu_fp16_wq8_pc \
             --gemm_plugin float16 \
             --workers 1 \
             --use_custom_all_reduce disable \
             --remove_input_padding enable \
             --use_paged_context_fmha enable \
             --max_batch_size 256

I run models with Triton docker:

docker run -d --rm --net host --shm-size=40g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all --name trt-llm \
	-v /root/dev/tensorrtllm_backend:/tensorrtllm_backend \
	-v /root/dev/models:/models \
	-v /root/models:/models-hub \
	trt-24-dev \
	mpirun --allow-run-as-root -n 1 /opt/tritonserver/bin/tritonserver  --log-info True --log-verbose 3 --model-repository=/models/triton/llama-fp8 --grpc-port=8001 --http-port=8000 --metrics-port=8002 --disable-auto-complete-config --backend-config=python,shm-region-prefix-name=prefix0_ :

I'm testing performance for different setups, and I ran into the following issue: When setting max_batch_size to high values (like 256) and running only 1 request at the same time, the performance of an fp8 model drops significantly compared to the int8 model and compared to a model built with max_batch_size=1.

I'm using mainly Locust for my tests, but to check if it's not the problem with my code, I also run the benchmark_core_model.py script. I had to make some code changes to it to simulate my approach: I'm forcing it to do only one request at a time. My changes to test_performance function:

    responses = [] <-------- HERE
    for i, ids in enumerate(input_start_ids):
        output0_len = np.ones_like([[1]]).astype(np.int32) * output_lens[i]
        end_id = np.full_like([[1]], 2).astype(np.int32)
        inputs = [
            utils.prepare_tensor("input_ids", ids, FLAGS.protocol),
            utils.prepare_tensor("input_lengths", input_lens[i],
                                 FLAGS.protocol),
            utils.prepare_tensor("request_output_len", output0_len,
                                 FLAGS.protocol),
            utils.prepare_tensor("end_id", end_id, FLAGS.protocol),
        ]

        # time.sleep(delays[i]) <-------- HERE

        if FLAGS.protocol == "http":
            async_requests.append(
                client.async_infer(model_name, inputs, request_id=str(i)))
        elif FLAGS.protocol == "grpc":
            async_requests.append(
                client.async_infer(model_name,
                                   inputs,
                                   callback=partial(callback, user_data,
                                                    datetime.now(), i),
                                   request_id=str(i)))
        responses.append(utils.get_grpc_results(user_data, 1)[0]) <-------- HERE

    try:
------- HERE ->
        # if FLAGS.protocol == "http":
        #     utils.get_http_results(async_requests)
        # elif FLAGS.protocol == "grpc":
        #     responses = utils.get_grpc_results(user_data, len(input_start_ids))
        # else:
        #     raise RuntimeError("Invalid protocol")
<------- HERE

Next, I run tests with the following command (with my own dataset):

python3 benchmark_core_model.py -i grpc --max-input-len 1024 --num-requests 100 --request-rate -1 --time-delay-dist constant dataset --dataset /data/questions_triton.json --tokenizer-dir meta-llama/Llama-2-13b-chat-hf --op-tokens-per-word 1.0

and with the synthetic dataset:

python3 benchmark_core_model.py -i grpc --max-input-len 1024 --num-requests 50 --request-rate -1 token-norm-dist --input-mean 128 --input-stdev 5 --output-mean 500 --output-stdev 20

Here are the results for fp8 and my dataset:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.25    |
|       OP tokens/sec        |   67.04   |
|     Avg. latency (ms)      |  4025.84  |
|      P99 latency (ms)      | 14156.30  |
|      P90 latency (ms)      |  8077.87  |
| Avg. IP tokens per request |   16.76   |
| Avg. OP tokens per request |  269.90   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 201303.43 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

for int8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.38    |
|       OP tokens/sec        |   96.53   |
|     Avg. latency (ms)      |  2640.49  |
|      P99 latency (ms)      |  9806.77  |
|      P90 latency (ms)      |  5151.49  |
| Avg. IP tokens per request |   16.76   |
| Avg. OP tokens per request |  254.92   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 132036.59 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

And for synthetic fp8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.14    |
|       OP tokens/sec        |   53.00   |
|     Avg. latency (ms)      |  7088.72  |
|      P99 latency (ms)      |  7553.55  |
|      P90 latency (ms)      |  7429.67  |
| Avg. IP tokens per request |  127.72   |
| Avg. OP tokens per request |  375.72   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 354447.98 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

for int8:

+----------------------------+-----------+
|            Stat            |   Value   |
+----------------------------+-----------+
|        Requests/Sec        |   0.20    |
|       OP tokens/sec        |   75.67   |
|     Avg. latency (ms)      |  4945.65  |
|      P99 latency (ms)      |  5381.79  |
|      P90 latency (ms)      |  5182.60  |
| Avg. IP tokens per request |  128.62   |
| Avg. OP tokens per request |  374.24   |
|   Avg. InFlight requests   |   0.00    |
|     Total latency (ms)     | 247294.83 |
|       Total requests       |   50.00   |
+----------------------------+-----------+

As you can see, there is a huge difference in OP tokens/sec between fp8 and int8. It made me suspicious because I expected the performance to be similar, so I started to look for the cause. After some time, I found out that the problem disappears when I build models with max_batch_size=1. Then the results for both fp8 and int8 are nearly the same: 103.079041 OP tokens/sec for int8 and 100.961634 for fp8. I investigated further and found that the models perform similarly up to around max_batch_size=64. Increasing it further causes fp8 performance to drop gradually.

I wonder if this issue also impacts the performance if there is more than 1 simultaneous request, but I can't check it. For example, I tested with 20 simultaneous requests (in Locust), and the performance is similar for both models: 65.904632 tokens/s for fp8 and 63.529027 for int8. But I don't know if they should be the same or if fp8 should be faster at this point.

I tested it on v0.8.0 and 0.9.0.dev2024032600.

I can provide more details or results if needed.

Looking forward to solving this issue together.

Expected behavior

Performance for fp8 and int8 models is comparable.

actual behavior

Performance of fp8 model drops significantly when increasing max_batch_size and running 1 request at a time.

additional notes

Mar 29 '24 16:03 bprus

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

May 19 '24 01:05 github-actions[bot]

Hi @bprus , can you try again on the latest main branch? We've integrated several optimizations including multiple profiles, which should minimize the impacts of max_batch_size to the kernel selection.

Besides, may I ask why did you set --gemm_plugin float16 to fp8 cases? Usually gemm plugin is not recommended to be enabled on fp8 cases. Can you also try and let us know if you can reproduce the perf issue by using gptManagerBenchmark? This is to eliminate the impacts of the Triton backend.

Thanks for your support and help.

May 23 '24 07:05 kaiyux

Hi @kaiyux, thanks for the reply. I'll check it in 2 weeks, because I'm out of office right now. I'll get back with the results as soon as I can.

May 23 '24 08:05 bprus

@kaiyux I tested the new version: 0.11.0.dev2024060400 When using multiple profiles the issue disappears. Thanks for the help!

I used --gemm_plugin float16 because it was in the official example:

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_fp8 \
             --output_dir ./engine_outputs \
             --gemm_plugin float16 \
             --strongly_typed \
             --workers 2

(now it's changed to --gemm_plugin auto) Thank you for letting me know that this is not a good practice. However, when I turned off the gemm plugin, I got the following warnings:

[06/05/2024-10:13:15] [TRT-LLM] [I] Set dtype to float16.
[06/05/2024-10:13:15] [TRT-LLM] [W] Parameter dtype is None, using default dtype: DataType.FLOAT, it is recommended to always specify dtype explicitly
...

I can't find any way to set dtype. Can you suggest something?

Also, when testing the current version, I stumbled on another issue: https://github.com/NVIDIA/TensorRT-LLM/issues/1738

Jun 05 '24 14:06 bprus

hi @bprus do u still have further issue or question now? If not, we'll close it soon.

Nov 14 '24 07:11 nv-guomingz

We can close, thanks.

Nov 14 '24 08:11 bprus