lorax
lorax copied to clipboard
Need some help. " You need to decrease --max-batch-prefill-tokens."
System Info
latest
Information
- [ ] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
4 A100(40G) can not run sucess. This should be a memory issue. What parameters should I adjust to run 72b model successfully? In 2 A100, I set --max-batch-prefill-tokens and all four para to 1, still cannot make it.
docker run --gpus '"device=0,1,2,3"'
--shm-size 1g
-p 8081:80
-v /home/unionlab001/Model/qwen-72b:/data ghcr.io/predibase/lorax:latest
--model-id /data/Qwen1_5-72B-Chat
--trust-remote-code
--quantize bitsandbytes-nf4
--max-batch-prefill-tokens 300
--max-input-length 200
--max-total-tokens 1024
--num-shard 4 \
RuntimeError: Not enough memory to handle 300 prefill tokens. You need to decrease
--max-batch-prefill-tokens
2024-04-04T14:14:39.297825Z ERROR warmup{max_input_length=200 max_prefill_tokens=300 max_total_tokens=1024}:warmup: lorax_client: router/client/src/lib.rs:34: Server error: Not enough memory to handle 300 prefill tokens. You need to decrease--max-batch-prefill-tokens
Expected behavior
none