lorax icon indicating copy to clipboard operation
lorax copied to clipboard

Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease `--max-batch-prefill-tokens`")

Open KrisWongz opened this issue 11 months ago • 2 comments

System Info

latest lorax

Information

  • [ ] Docker
  • [ ] The CLI directly

Tasks

  • [ ] An officially supported command
  • [ ] My own modifications

Reproduction

@tgadair Thanks a lot. I'm sure the GPU is empty. image The code:

docker run --gpus '"device=1"'
--shm-size 1g
-p 8081:80
-v /home/Model/qwen:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-14B-Chat
--trust-remote-code
--max-batch-prefill-tokens 1024 \

The error:

RuntimeError: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens 2024-03-20T07:57:45.050465Z ERROR warmup{max_input_length=1024 max_prefill_tokens=1024 max_total_tokens=2048}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens Error: Warmup(Generation("Not enough memory to handle 1024 prefill tokens. You need to decrease --max-batch-prefill-tokens")) 2024-03-20T07:57:45.117908Z ERROR lorax_launcher: Webserver Crashed 2024-03-20T07:57:45.117934Z INFO lorax_launcher: Shutting down shards 2024-03-20T07:57:45.123286Z INFO shard-manager: lorax_launcher: Shard terminated rank=0

Then I set it to 2 GPUs, but it used 70GB memery:

docker run --gpus '"device=1,2"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-14B-Chat
--trust-remote-code
--max-batch-prefill-tokens 1024
--num-shard 2 \

image

I had a same issue before (with qwen1 #115 ). Maybe the same with #120

Expected behavior

none

KrisWongz avatar Mar 22 '24 09:03 KrisWongz

+1, same error happened for Qwen1.5-14B-Chat-AWQ (1*4090) and Qwen1.5-70B-Chat-AWQ (4*4090), even decreased the --max-batch-prefill-tokens to 512.

thincal avatar Mar 23 '24 01:03 thincal

docker run --gpus '"device=0,1,2,3"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen-72b:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-72B-Chat
--trust-remote-code
--quantize bitsandbytes-nf4
--max-batch-prefill-tokens 300
--max-input-length 200
--max-total-tokens 1024
--num-shard 4 \

4 A100(40G) can not run sucess. What happen.

RuntimeError: Not enough memory to handle 300 prefill tokens. You need to decrease --max-batch-prefill-tokens 2024-04-04T14:14:39.297825Z ERROR warmup{max_input_length=200 max_prefill_tokens=300 max_total_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 300 prefill tokens. You need to decrease --max-batch-prefill-tokens

KrisWongz avatar Apr 04 '24 14:04 KrisWongz