TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Mixtral with TP hangs indefinitely if another process uses the same GPU with ONNX

Open v-dicicco opened this issue 1 year ago • 3 comments

System Info

  • CPU architecture: x86_64
  • GPU name: NVIDIA A40, 46GB
  • TensorRT-LLM: v0.9.0
  • Os: Ubuntu 20.04
  • Nvidia Driver: 535.54.03, Cuda: 12.2

Who can help?

@kaiyux @byshiue

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

If a process is using TensorRT-LLM to continuously do inference using Mixtral 8x7B with TP (tested with TP=2, 4 and 8) as soon as another process use the same GPU to do inference with ONNX (with Cuda or TensorRT provider), TensorRT-LLM inference will hang. Same behavior happens if you do inference with triton using tensorrtllm_backend.

Here are the steps to reproduce using code in the repo:

  1. Checkout v0.9.0 tag in TensorRT-LLM
  2. Modify the example/run.py script, to continuously do inference. Here is a gist with the script already modified ready to download (the patch would be a bit long), but it just wraps in an infinite loop the warmup/generate of the --run_profiling code to simulate a process that continuously does inference.
  3. Convert and build Mixtral 8x7B using TP=2 (I've used int8), then run inference using the --run_profiling to simulate inference:
mpirun -n 2 python3 run.py --max_output_len 100 --tokenizer_dir <path_to_tokenizer> --engine_dir <path_to_trt_mixtral>  --run_profiling
  1. The code will iterate the profiling. Please ignore the actual numbers, it is just to simulate inference:
batch_size: 1, avg latency of 1 iterations: : 1.442868947982788 sec
batch_size: 1, avg latency of 1 iterations: : 1.4428672790527344 sec
batch_size: 1, avg latency of 1 iterations: : 2.886024236679077 sec
batch_size: 1, avg latency of 1 iterations: : 2.8860158920288086 sec
batch_size: 1, avg latency of 1 iterations: : 4.32866644859314 sec
batch_size: 1, avg latency of 1 iterations: : 4.328623056411743 sec
batch_size: 1, avg latency of 1 iterations: : 5.7726891040802 sec
batch_size: 1, avg latency of 1 iterations: : 5.772603273391724 sec
batch_size: 1, avg latency of 1 iterations: : 7.215593338012695 sec
batch_size: 1, avg latency of 1 iterations: : 7.215665578842163 sec
batch_size: 1, avg latency of 1 iterations: : 8.658446550369263 sec
batch_size: 1, avg latency of 1 iterations: : 8.6583890914917 sec
  1. Run another, external, script using ONNX on the same GPU. Here is a quick setup:
### setup env:
python3.10 -m venv venv
source venv/bin/activate
pip install transformers[onnx] torch
pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/

### Download a model & convert to ONNX
huggingface-cli download distilbert/distilbert-base-uncased --local-dir model
python -m transformers.onnx -m ./model onnx
  1. Create the following inference.py
from transformers import AutoTokenizer
from onnxruntime import InferenceSession, SessionOptions

import torch

providers = [("CUDAExecutionProvider", {"device_id": torch.cuda.current_device()})]

sess_options = SessionOptions()
session = InferenceSession("onnx/model.onnx", session_options=sess_options, providers=providers)

tokenizer = AutoTokenizer.from_pretrained("model")
inputs = tokenizer("Using DistilBERT with ONNX Runtime!", return_tensors="np")

while True:
    outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
    print(outputs[0].shape)
  1. run python3 run.py be sure the script is actually using one of the GPUs that TensorRT-LLM is using to run Mixtral
  2. As soon as the ONNX script will actually start to do inference, the TensorRT-LLM process will hang (you will see it stops printing). Even if you stop the ONNX process, TensorRT-LLM will not recover.

Expected behavior

Inference with TensorRT-LLM should not hang if processes are using the same GPU

actual behavior

Inference with TensorRT-LLM hangs after ONNX process starts its inference

additional notes

  • when the hang happens GPUs utilization is stuck at 100% (but the power is low)
  • the bug happen even if the inference is done through triton (with tensorrtllm_backend) here I'm using the run.py just to (hopefully) help you reproduce the issue
  • the bug doesn't happen if Mixtral is not using TP (e.g: with an A100 80GB + int8)
  • I've tried to debug a bit lower level, and looking to the logs of TensorRT-LLM (with a different setup) the code seem stuck in this synchronize
  • I've hypothesized that both ONNX and TensorRT-LLM could use the same Cuda Stream, and for this reason they influence each other (not sure if this make sense). I've tried to force ONNX to use a different one...without luck.

v-dicicco avatar May 14 '24 16:05 v-dicicco

Could you try disabling the use_custom_all_reduce during building engine?

byshiue avatar May 15 '24 06:05 byshiue

Disabling use_custom_all_reduce solved the issue, many thanks!

Does this means there is a bug in the custom all_reduce plugin, and if so: do you think this will be fixed? Furthermore, the description of the flag says that when enabled it should help reducing latency with NVLink setup (my scenario), but there isn't any benchmark...I'm trying to benchmark it but would be really helpful to have a rough idea of the expected impact when it is disabled, are there additionals details available somewhere?

v-dicicco avatar May 15 '24 14:05 v-dicicco

@v-dicicco we are still not quite sure if the bug is inside the kernel or we have to setup something to get it right.

but there isn't any benchmark...

we have the all_reduce tests here.

PerkzZheng avatar May 17 '24 07:05 PerkzZheng

hi @v-dicicco do u still have further issue or question now? If not, we'll close it soon.

nv-guomingz avatar Nov 14 '24 06:11 nv-guomingz

Hi @nv-guomingz disabling use_custom_all_reduce solved the reported issue with TRT-LLM 0.11.0 so this specific issue can be closed.

As a last question: I saw that in recent versions of TRT-LLM (>= 0.12.0) it is not possible to disable 'use_custom_all_reduce'; does it mean that it is now active by default? I haven't tested whether updating introduces the same problem again but I plan to do a test

v-dicicco avatar Nov 18 '24 15:11 v-dicicco

use_custom_all_reduce

Hi @v-dicicco , you're right, the use_custom_all_reduce knob had been removed from trtllm-build commmand since 0.12 release. Now trtllm will select the nccl nativeall-reduce and the custome all-reduce plugin automatically by internal strategy.

Please feel free to reopen this ticket if needed.

nv-guomingz avatar Nov 18 '24 16:11 nv-guomingz