text-generation-inference Webserver crashing with GPTQ model `Server error: transport error Error: Warmup(Generation("transport error"))`

System Info

Lambdalabs H100 instance, Ubuntu, running with Docker text-generation-inference:latest

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[X] My own modifications

Reproduction

Variables:

model=TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ
num_shard=1
volume=/home/ubuntu/data # share a volume with the Docker container to avoid downloading weights every run

Docker run command:

sudo docker run -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 -d --log-driver json-file --gpus all --shm-size 40g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --revision gptq-4bit-128g-actorder_True --quantize gptq

Expected behavior

The model should load. Instead it gives this error:

2023-07-14T01:45:00.352230Z  INFO text_generation_launcher: Starting Webserver
2023-07-14T01:45:01.236319Z  INFO text_generation_router: router/src/main.rs:346: Serving revision eee3325448b6991efc148a34d42b737580439b8d of model TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ
2023-07-14T01:45:01.248137Z  INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-14T01:45:02.429524Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-14T01:45:02.457462Z ERROR text_generation_launcher: Webserver Crashed
2023-07-14T01:45:02.457525Z  INFO text_generation_launcher: Shutting down shards
Error: WebserverFailed
2023-07-14T01:45:02.614000Z  INFO text_generation_launcher: Shard 0 terminated

Jul 14 '23 01:07 itsmnjn

Just to say "me too" on this.

I'm also using docker run on a Lamba Labs H100, and get the identical error

As discussed in the linked thread, Olivier suggested I try container ghcr.io/huggingface/text-generation-inference:sha-44acf72 with -e LOG_LEVEL=info,text_generation_launcher=debug to get extra logging, but unfortunately no extra logs were added.

My logs:

Model is: TheBloke/vicuna-13b-v1.3.0-GPTQ
Launching docker with command: docker run --name tgi -e LOG_LEVEL=info,text_generation_launcher=debug,text_generation_router=debug -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 -e REVISION=gptq-4bit-128g-actorder_True --rm -v /workspace/data:/data --shm-size=1gb --runtime=nvidia --gpus all -p 2222:22/tcp -p 8000:8000/tcp ghcr.io/huggingface/text-generation-inference:sha-44acf72 --model-id 'TheBloke/vicuna-13b-v1.3.0-GPTQ' --port 8000 --hostname 0.0.0.0 --revision gptq-4bit-128g-actorder_True --quantize gptq
2023-07-17T20:33:00.664497Z  INFO text_generation_launcher: Args { model_id: "TheBloke/vicuna-13b-v1.3.0-GPTQ", revision: Some("gptq-4bit-128g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-17T20:33:00.664674Z  INFO download: text_generation_launcher: Starting download process.
2023-07-17T20:33:03.080136Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2023-07-17T20:33:03.570459Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-17T20:33:03.571018Z  INFO shard-manager: text_generation_launcher: Starting shard 0 rank=0
2023-07-17T20:33:06.971252Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971330Z DEBUG text_generation_launcher: ===================================BUG REPORT===================================
2023-07-17T20:33:06.971344Z DEBUG text_generation_launcher: Welcome to bitsandbytes. For bug reports, please run
2023-07-17T20:33:06.971351Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971358Z DEBUG text_generation_launcher: python -m bitsandbytes
2023-07-17T20:33:06.971364Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971371Z DEBUG text_generation_launcher:  and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
2023-07-17T20:33:06.971378Z DEBUG text_generation_launcher: ================================================================================
2023-07-17T20:33:06.971384Z DEBUG text_generation_launcher: bin /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
2023-07-17T20:33:06.971390Z DEBUG text_generation_launcher: CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
2023-07-17T20:33:06.971396Z DEBUG text_generation_launcher: CUDA SETUP: Highest compute capability among GPUs detected: 9.0
2023-07-17T20:33:06.971402Z DEBUG text_generation_launcher: CUDA SETUP: Detected CUDA version 118
2023-07-17T20:33:06.971408Z DEBUG text_generation_launcher: CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-07-17T20:33:06.971449Z  WARN text_generation_launcher: We're not using custom kernels.

2023-07-17T20:33:13.196018Z  INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0

2023-07-17T20:33:13.290926Z  INFO shard-manager: text_generation_launcher: Shard 0 ready in 9.718539821s rank=0
2023-07-17T20:33:13.387844Z  INFO text_generation_launcher: Starting Webserver
2023-07-17T20:33:14.404158Z  INFO text_generation_router: router/src/main.rs:346: Serving revision 738964d963bd0aa13be29cb3ded3590a803b3b7e of model TheBloke/vicuna-13b-v1.3.0-GPTQ
2023-07-17T20:33:14.414842Z  INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-17T20:33:15.580663Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-17T20:33:15.693480Z ERROR text_generation_launcher: Webserver Crashed
2023-07-17T20:33:15.693524Z  INFO text_generation_launcher: Shutting down shards
2023-07-17T20:33:15.698591Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

python3.9: /opt/conda/conda-bld/torchtriton_1677881350057/work/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion `false && "computeCapability > 90 not supported"' failed. rank=0
2023-07-17T20:33:15.698669Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0

Jul 17 '23 20:07 TheBloke

Oh wow, in posting the error log I just noticed the actual error :)

Here we go:

python3.9: /opt/conda/conda-bld/torchtriton_1677881350057/work/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion `false && "computeCapability > 90 not supported"' failed. rank=0
2023-07-17T20:33:15.698669Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0

OK, now I know what to do. I need Pytorch 2.1

Jul 17 '23 20:07 TheBloke

Nice thank you! I will update the version of torch in the image as soon as 2.1 is out. Cheers!

Jul 18 '23 06:07 OlivierDehaene

Since 2.1 is not gonna be out for another two months, I assume the easiest workaround for now is probably gonna be to completely rebuild the docker container with the pytorch nightlies? I struggled a bit with getting everything synced, so I tried just updating pytorch for the existing container, but I think that creates more problems than solutions with all the libraries being built with pytorch 2.0.1.

Aug 30 '23 14:08 stefanobranco

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 29 '24 01:04 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Webserver crashing with GPTQ model `Server error: transport error Error: Warmup(Generation("transport error"))`

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard