lorax Need some help. " You need to decrease --max-batch-prefill-tokens."

Need some help. " You need to decrease --max-batch-prefill-tokens."

Open KrisWongz opened this issue 10 months ago • 1 comments

System Info

latest

Information

[ ] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

4 A100(40G) can not run sucess. This should be a memory issue. What parameters should I adjust to run 72b model successfully? In 2 A100, I set --max-batch-prefill-tokens and all four para to 1, still cannot make it.

docker run --gpus '"device=0,1,2,3"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen-72b:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-72B-Chat
--trust-remote-code
--quantize bitsandbytes-nf4
--max-batch-prefill-tokens 300
--max-input-length 200
--max-total-tokens 1024
--num-shard 4 \

RuntimeError: Not enough memory to handle 300 prefill tokens. You need to decrease --max-batch-prefill-tokens 2024-04-04T14:14:39.297825Z ERROR warmup{max_input_length=200 max_prefill_tokens=300 max_total_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 300 prefill tokens. You need to decrease --max-batch-prefill-tokens

Expected behavior

none

Apr 05 '24 13:04 KrisWongz

lorax lorax copied to clipboard

Need some help. " You need to decrease --max-batch-prefill-tokens."

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard