lorax
lorax copied to clipboard
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")
System Info
latest lorax
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
@tgadair Thanks a lot.
I'm sure the GPU is empty.
The code:
docker run --gpus '"device=1"'
--shm-size 1g
-p 8081:80
-v /home/Model/qwen:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-14B-Chat
--trust-remote-code
--max-batch-prefill-tokens 1024 \
The error:
RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2024-03-20T07:57:45.050465Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease--max-batch-prefill-tokens
Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease--max-batch-prefill-tokens
")) 2024-03-20T07:57:45.117908Z ERROR lorax_launcher: Webserver Crashed 2024-03-20T07:57:45.117934Z INFO lorax_launcher: Shutting down shards 2024-03-20T07:57:45.123286Z INFO shard-manager: lorax_launcher: Shard terminated rank=0
Then I set it to 2 GPUs, but it used 70GB memery:
docker run --gpus '"device=1,2"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-14B-Chat
--trust-remote-code
--max-batch-prefill-tokens 1024
--num-shard 2 \
I had a same issue before (with qwen1 #115 ). Maybe the same with #120
Expected behavior
none
+1, same error happened for Qwen1.5-14B-Chat-AWQ (1*4090)
and Qwen1.5-70B-Chat-AWQ (4*4090)
, even decreased the --max-batch-prefill-tokens
to 512.
docker run --gpus '"device=0,1,2,3"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen-72b:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-72B-Chat
--trust-remote-code
--quantize bitsandbytes-nf4
--max-batch-prefill-tokens 300
--max-input-length 200
--max-total-tokens 1024
--num-shard 4 \
4 A100(40G) can not run sucess. What happen.
RuntimeError: Not enough memory to handle 300 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2024-04-04T14:14:39.297825Z ERROR warmup{max_input_length=200 max_prefill_tokens=300 max_total_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 300 prefill tokens. You need to decrease--max-batch-prefill-tokens