tensorrtllm_backend Mixtral 8x7-v0.1 Hangs after serving a few requests

System Info

A100 160GB(2*80)

Who can help?

@byshiue @kaiyux

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Build for source by cloning the main on tensorrtllm_backend

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Download weights from HF

pip install -r requirements.txt # install latest version of transformers, needed for Mixtral

git lfs install
git clone https://huggingface.co/mistralai/Mixtral-8x7B-v0.1

Set Directory and generate engines

export HF_LLAMA_MODEL=/path/Mixtral-8x7B-v0.1
export UNIFIED_CKPT_PATH=/path/mixtral-56B/
export ENGINE_PATH=/path/tensorrtllm_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/

python3 ./examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                             --output_dir ${UNIFIED_CKPT_PATH} \
                             --dtype float16 \
                             --tp_size 2 \
                             --dtype float16 \
                            --use_weight_only \
                            --weight_only_precision int4 

python3 -m tensorrt_llm.commands.build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
                 --output_dir ${ENGINE_PATH} \
                 --gemm_plugin float16 \
                 --max_input_len 32000 \

Then Start your triton server like so

cp all_models/inflight_batcher_llm/ mixtral_ifb -r

python3 tools/fill_template.py -i mixtral_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False
python3 tools/fill_template.py -i mixtral_ifb/ensemble/config.pbtxt triton_max_batch_size:64
python3 tools/fill_template.py -i mixtral_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:20000,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0

python3 scripts/launch_triton_server.py --world_size=2 --model_repo=mixtral_ifb/ --log

Finally in a separate terminal

sudo docker run --gpus all --rm -it --net host -v /home/azureuser/:/home/azureuser/ nvcr.io/nvidia/tritonserver:24.03-py3-sdk

perf_analyzer -m ensemble --input-data llm_inputs.json --measurement-interval 45000 --service-kind triton --request-rate-range 0.5:1.5:0.5 --request-distribution constant --stability-percentage 1000 -i grpc -u localhost:8001 --shape max_tokens:1 --shape text_input:1 -f output2k_large.csv --verbose-csv --collect-metrics

Expected behavior

The expected behavior should be getting thoughput and latency numbers

actual behavior

Command just hangs and doesnt return anything

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 45000 msec
  Latency limit: 0 msec
  Request Rate limit: 1.5 requests per seconds
  Using uniform distribution on request generation
  Using synchronous calls for inference
  Stabilizing using average latency

Request Rate: 0.5 inference requests per seconds
failed to find the requested model version

additional notes

I wrote a custom script which uses gprc over tritonclient to send synchronous requests. Initially it completes the request in 8seconds but after 40 such requests it just hangs.

The tritonserver logs after verbosity are like this

I0514 21:01:55.276668 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110579,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.284479 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110580,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.292292 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110581,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.299936 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110582,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.307794 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110583,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.315951 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110584,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.323544 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110585,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.331394 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110586,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}
I0514 21:01:55.339250 160 model_instance_state.cc:813] {"Active Request Count":1,"Iteration Counter":110587,"Max Request Count":1,"Runtime CPU Memory Usage":404,"Runtime GPU Memory Usage":6496902765,"Runtime Pinned Memory Usage":1074138393,"Timestamp":"05-14-2024 21:01:55","Context Requests":0,"Generation Requests":1,"MicroBatch ID":0,"Paused Requests":0,"Scheduled Requests":1,"Total Context Tokens":0,"Free KV cache blocks":137,"Max KV cache blocks":157,"Tokens per KV cache block":128,"Used KV cache blocks":20}

And never returns a response back and just hands

Quantization to int 4 doesnt help either

May 15 '24 01:05 aaditya-srivathsan

@aaditya-srivathsan We are reviewing this ticket will get back to with updates.

May 20 '24 20:05 ganeshku1

@ganeshku1 any update on this?

May 31 '24 00:05 aaditya-srivathsan

@aaditya-srivathsan We are working on resolving this issue. Will update this thread once this issue is resolved.

cc: @dyastremsky

May 31 '24 15:05 ganeshku1

Hi @aaditya-srivathsan, I've seen some similar issues reported that were solved by setting --use_custom_all_reduce disable.

Can you try this to see if it helps?

Jun 10 '24 19:06 rmccorm4

Sure let me try this and ill let you know if this works or not

Jun 13 '24 19:06 aaditya-srivathsan

This did help thank you very much!

Jun 25 '24 23:06 aaditya-srivathsan