text-generation-inference
text-generation-inference copied to clipboard
get stucked when run text-generation-benchmark on AMD gpu
System Info
Target: x86_64-unknown-linux-gnu Cargo version: 1.78.0 Commit sha: 96b7b40ca3e39f7ca5b875bff9a4665c1b175289 Docker label: sha-96b7b40-rocm
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
I followed the step from website https://github.com/huggingface/hf-rocm-benchmark
- the docker container, a local model is used and the server is setup successfully.
docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
--device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 256g \
--net host -v $(pwd)/hf_cache:/data -e HUGGING_FACE_HUB_TOKEN=$HF_READ_TOKEN \
ghcr.io/huggingface/text-generation-inference:sha-293b8125-rocm \
--model-id local_path/Meta-Llama-70B-Instruct --num-shard 8
- open another shell:
docker exec -it tgi_container_name /bin/bash
- run the benchmark
text-generation-benchmark --tokenizer-name meta-llama/Meta-Llama-3-70B-Instruct \
--sequence-length 2048 --decode-length 128 --warmups 2 --runs 10 \
-b 1 -b 2
and it stucked after the following log
2024-06-17T11:01:59.291750Z INFO text_generation_benchmark: benchmark/src/main.rs:138: Loading tokenizer
2024-06-17T11:01:59.291802Z INFO text_generation_benchmark: benchmark/src/main.rs:144: Found local tokenizer
2024-06-17T11:01:59.336401Z INFO text_generation_benchmark: benchmark/src/main.rs:161: Tokenizer loaded
2024-06-17T11:01:59.365280Z INFO text_generation_benchmark: benchmark/src/main.rs:170: Connect to model server
2024-06-17T11:01:59.368575Z INFO text_generation_benchmark: benchmark/src/main.rs:179: Connected
I also tried llama2-7b with a single GPU card with sequence-length of 512 and decode-length of 128, but stucked too.
2024-06-17T10:54:34.661975Z INFO text_generation_launcher: Convert: [1/2] -- Took: 0:00:23.355863
2024-06-17T10:54:42.624075Z INFO text_generation_launcher: Convert: [2/2] -- Took: 0:00:07.961668
2024-06-17T10:54:43.550339Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-06-17T10:54:43.550676Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-06-17T10:54:46.861699Z INFO text_generation_launcher: Detected system rocm
2024-06-17T10:54:46.929654Z INFO text_generation_launcher: ROCm: using Flash Attention 2 Composable Kernel implementation.
2024-06-17T10:54:47.181972Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'
2024-06-17T10:54:53.564579Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-17T10:54:58.632695Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-06-17T10:54:58.670817Z INFO shard-manager: text_generation_launcher: Shard ready in 15.119042733s rank=0
2024-06-17T10:54:58.766242Z INFO text_generation_launcher: Starting Webserver
2024-06-17T10:54:58.849177Z INFO text_generation_router: router/src/main.rs:302: Using config Some(Llama)
2024-06-17T10:54:58.849209Z WARN text_generation_router: router/src/main.rs:311: no pipeline tag found for model /home/zhuh/7b-chat-hf
2024-06-17T10:54:58.849213Z WARN text_generation_router: router/src/main.rs:329: Invalid hostname, defaulting to 0.0.0.0
2024-06-17T10:54:58.853566Z INFO text_generation_router::server: router/src/server.rs:1552: Warming up model
2024-06-17T10:54:59.601144Z INFO text_generation_launcher: PyTorch TunableOp (https://github.com/fxmarty/pytorch/tree/2.3-patched/aten/src/ATen/cuda/tunable) is enabled. The warmup may take several minutes, picking the ROCm optimal matrix multiplication kernel for the target lengths 1, 2, 4, 8, 16, 32, with typical 5-8% latency improvement for small sequence lengths. The picked GEMMs are saved in the file /data/tunableop_-home-zhuh-7b-chat-hf_tp1_rank0.csv. To disable TunableOp, please launch TGI with `PYTORCH_TUNABLEOP_ENABLED=0`.
2024-06-17T10:54:59.601247Z INFO text_generation_launcher: Warming up TunableOp for seqlen=1
2024-06-17T10:55:46.295162Z INFO text_generation_launcher: Warming up TunableOp for seqlen=2
2024-06-17T10:56:18.910991Z INFO text_generation_launcher: Warming up TunableOp for seqlen=4
2024-06-17T10:56:51.715308Z INFO text_generation_launcher: Warming up TunableOp for seqlen=8
2024-06-17T10:57:24.784412Z INFO text_generation_launcher: Warming up TunableOp for seqlen=16
2024-06-17T10:57:59.430531Z INFO text_generation_launcher: Warming up TunableOp for seqlen=32
2024-06-17T10:58:29.335915Z INFO text_generation_launcher: Cuda Graphs are enabled for sizes [1, 2, 4, 8, 16, 32]
2024-06-17T10:58:30.344828Z INFO text_generation_router::server: router/src/server.rs:1579: Using scheduler V3
2024-06-17T10:58:30.344853Z INFO text_generation_router::server: router/src/server.rs:1631: Setting max batch total tokens to 346576
2024-06-17T10:58:30.360395Z INFO text_generation_router::server: router/src/server.rs:1868: Connected
Expected behavior
Prefill and decode latency is expected but it gets stacked and output nothing in nearly one hour. Besides, the GPU usibility is zero, which is non-zero when setup the warmup steps