text-generation-inference
text-generation-inference copied to clipboard
Webserver crashing with GPTQ model `Server error: transport error Error: Warmup(Generation("transport error"))`
System Info
Lambdalabs H100 instance, Ubuntu, running with Docker text-generation-inference:latest
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [X] My own modifications
Reproduction
Variables:
model=TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ
num_shard=1
volume=/home/ubuntu/data # share a volume with the Docker container to avoid downloading weights every run
Docker run command:
sudo docker run -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 -d --log-driver json-file --gpus all --shm-size 40g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --revision gptq-4bit-128g-actorder_True --quantize gptq
Expected behavior
The model should load. Instead it gives this error:
2023-07-14T01:45:00.352230Z INFO text_generation_launcher: Starting Webserver
2023-07-14T01:45:01.236319Z INFO text_generation_router: router/src/main.rs:346: Serving revision eee3325448b6991efc148a34d42b737580439b8d of model TheBloke/WizardLM-33B-V1.0-Uncensored-GPTQ
2023-07-14T01:45:01.248137Z INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-14T01:45:02.429524Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-14T01:45:02.457462Z ERROR text_generation_launcher: Webserver Crashed
2023-07-14T01:45:02.457525Z INFO text_generation_launcher: Shutting down shards
Error: WebserverFailed
2023-07-14T01:45:02.614000Z INFO text_generation_launcher: Shard 0 terminated
Just to say "me too" on this.
I'm also using docker run on a Lamba Labs H100, and get the identical error
As discussed in the linked thread, Olivier suggested I try container ghcr.io/huggingface/text-generation-inference:sha-44acf72
with -e LOG_LEVEL=info,text_generation_launcher=debug
to get extra logging, but unfortunately no extra logs were added.
My logs:
Model is: TheBloke/vicuna-13b-v1.3.0-GPTQ
Launching docker with command: docker run --name tgi -e LOG_LEVEL=info,text_generation_launcher=debug,text_generation_router=debug -e GPTQ_BITS=4 -e GPTQ_GROUPSIZE=128 -e REVISION=gptq-4bit-128g-actorder_True --rm -v /workspace/data:/data --shm-size=1gb --runtime=nvidia --gpus all -p 2222:22/tcp -p 8000:8000/tcp ghcr.io/huggingface/text-generation-inference:sha-44acf72 --model-id 'TheBloke/vicuna-13b-v1.3.0-GPTQ' --port 8000 --hostname 0.0.0.0 --revision gptq-4bit-128g-actorder_True --quantize gptq
2023-07-17T20:33:00.664497Z INFO text_generation_launcher: Args { model_id: "TheBloke/vicuna-13b-v1.3.0-GPTQ", revision: Some("gptq-4bit-128g-actorder_True"), validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Gptq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: 16000, max_waiting_tokens: 20, hostname: "0.0.0.0", port: 8000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_domain: None, ngrok_username: None, ngrok_password: None, env: false }
2023-07-17T20:33:00.664674Z INFO download: text_generation_launcher: Starting download process.
2023-07-17T20:33:03.080136Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2023-07-17T20:33:03.570459Z INFO download: text_generation_launcher: Successfully downloaded weights.
2023-07-17T20:33:03.571018Z INFO shard-manager: text_generation_launcher: Starting shard 0 rank=0
2023-07-17T20:33:06.971252Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971330Z DEBUG text_generation_launcher: ===================================BUG REPORT===================================
2023-07-17T20:33:06.971344Z DEBUG text_generation_launcher: Welcome to bitsandbytes. For bug reports, please run
2023-07-17T20:33:06.971351Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971358Z DEBUG text_generation_launcher: python -m bitsandbytes
2023-07-17T20:33:06.971364Z DEBUG text_generation_launcher:
2023-07-17T20:33:06.971371Z DEBUG text_generation_launcher: and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
2023-07-17T20:33:06.971378Z DEBUG text_generation_launcher: ================================================================================
2023-07-17T20:33:06.971384Z DEBUG text_generation_launcher: bin /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
2023-07-17T20:33:06.971390Z DEBUG text_generation_launcher: CUDA SETUP: CUDA runtime path found: /opt/conda/lib/libcudart.so.11.0
2023-07-17T20:33:06.971396Z DEBUG text_generation_launcher: CUDA SETUP: Highest compute capability among GPUs detected: 9.0
2023-07-17T20:33:06.971402Z DEBUG text_generation_launcher: CUDA SETUP: Detected CUDA version 118
2023-07-17T20:33:06.971408Z DEBUG text_generation_launcher: CUDA SETUP: Loading binary /opt/conda/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
2023-07-17T20:33:06.971449Z WARN text_generation_launcher: We're not using custom kernels.
2023-07-17T20:33:13.196018Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2023-07-17T20:33:13.290926Z INFO shard-manager: text_generation_launcher: Shard 0 ready in 9.718539821s rank=0
2023-07-17T20:33:13.387844Z INFO text_generation_launcher: Starting Webserver
2023-07-17T20:33:14.404158Z INFO text_generation_router: router/src/main.rs:346: Serving revision 738964d963bd0aa13be29cb3ded3590a803b3b7e of model TheBloke/vicuna-13b-v1.3.0-GPTQ
2023-07-17T20:33:14.414842Z INFO text_generation_router: router/src/main.rs:212: Warming up model
2023-07-17T20:33:15.580663Z ERROR warmup{max_input_length=1024 max_prefill_tokens=4096 max_total_tokens=16000}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: transport error
Error: Warmup(Generation("transport error"))
2023-07-17T20:33:15.693480Z ERROR text_generation_launcher: Webserver Crashed
2023-07-17T20:33:15.693524Z INFO text_generation_launcher: Shutting down shards
2023-07-17T20:33:15.698591Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
python3.9: /opt/conda/conda-bld/torchtriton_1677881350057/work/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion `false && "computeCapability > 90 not supported"' failed. rank=0
2023-07-17T20:33:15.698669Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0
Oh wow, in posting the error log I just noticed the actual error :)
Here we go:
python3.9: /opt/conda/conda-bld/torchtriton_1677881350057/work/lib/Dialect/TritonGPU/Transforms/Combine.cpp:870: int {anonymous}::{anonymous}::computeCapabilityToMMAVersion(int): Assertion `false && "computeCapability > 90 not supported"' failed. rank=0
2023-07-17T20:33:15.698669Z ERROR shard-manager: text_generation_launcher: Shard process was signaled to shutdown with signal 6 rank=0
OK, now I know what to do. I need Pytorch 2.1
Nice thank you! I will update the version of torch in the image as soon as 2.1 is out. Cheers!
Since 2.1 is not gonna be out for another two months, I assume the easiest workaround for now is probably gonna be to completely rebuild the docker container with the pytorch nightlies? I struggled a bit with getting everything synced, so I tried just updating pytorch for the existing container, but I think that creates more problems than solutions with all the libraries being built with pytorch 2.0.1.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.