TensorRT-LLM icon indicating copy to clipboard operation
TensorRT-LLM copied to clipboard

Bus error running t5 conversion script using the latest main

Open sc-gr opened this issue 1 year ago • 1 comments

System Info

GPU (a10g). I have tried with an AWS g5.2xlarge instance and AWS g5.12xlarge instance.

Who can help?

@byshiue

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I pretty much follow the official installation:

  1. docker run --shm-size=2g --rm --runtime=nvidia --GPUs all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04
  2. apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git python-is-python3 vim
  3. pip3 install tensorrt_llm -U --pre --extra-index-url https://pypi.nvidia.com
  4. git clone https://github.com/NVIDIA/TensorRT-LLM.git (05/02 version)
  5. cd TensorRT-LLM
export MODEL_TYPE="t5"
export MODEL_NAME="google/flan-t5-large"
export INFERENCE_PRECISION="float32"
export TP_SIZE=1
export PP_SIZE=1
export WORLD_SIZE=1

python examples/enc_dec/convert_checkpoint.py --model_type ${MODEL_TYPE}   
              --model_dir ${MODEL_NAME}         
        --output_dir tmp/trt_models/${MODEL_NAME}/${INFERENCE_PRECISION}        
        --tp_size ${TP_SIZE}            
       --pp_size ${PP_SIZE}             
       --weight_data_type float32            
       --dtype ${INFERENCE_PRECISION}

Expected behavior

Model converted

actual behavior

[TensorRT-LLM] TensorRT-LLM version: 0.10.0.dev2024043000
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:1132: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/usr/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 2 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
Bus error (core dumped)

additional notes

I also tried to use bart model with the same script, and it successfully exits. Just change to export MODEL_TYPE="bart" export MODEL_NAME="facebook/bart-large-cnn". So this might be a t5 architecture only problem, or it could relate to the GPU type I'm using (a10g)

sc-gr avatar May 03 '24 04:05 sc-gr

Also running into the same issue but on a single A100 80 GB gpu.

aravindMahadevan avatar May 06 '24 19:05 aravindMahadevan

Have a same issue on the single A100 80 GB, converting a t5

TeamSeshDeadBoy avatar May 08 '24 13:05 TeamSeshDeadBoy

investgating

symphonylyh avatar May 08 '24 15:05 symphonylyh

Hi @sc-gr @aravindMahadevan @TeamSeshDeadBoy,

I am able to reproduce the error with the reproduction step. In short, the reason is that python multiprocessing memory size is limited by docker run. I've tested out myself, on the same A100 node, run the docker run command with the command below, the rest can be the same, solves the problem: docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --entrypoint /bin/bash --runtime=nvidia -it nvidia/cuda:12.1.0-devel-ubuntu22.04

The official installation guide is a faster way to install but may not support well on larger model (in this case, a python multiprocessing error). The official build from source guide can be more reliable as it builds directly from the latest cloned repository instead of pip packages.

Thanks!

jhaotingc avatar May 08 '24 18:05 jhaotingc

It works after adding --ulimit memlock=-1 --ulimit stack=67108864, thanks!

sc-gr avatar May 09 '24 00:05 sc-gr

Update: The bus error is triggered by not adding --ipc=host argument. docker run -it --gpus=all --ipc=host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --runtime=nvidia [DOCKER_IMAGE] bash This would not trigger bus error.

jhaotingc avatar Jun 06 '24 18:06 jhaotingc