lorax
lorax copied to clipboard
Lorax Stop Responding after non-concurrent/concurrent requests
System Info
Hi, I'm currently working on using Lorax to test, benchmark, and serve my LoRA adapters, but I keep facing the same problem. The issue is that after concurrent or non-concurrent requests (about ~195), the OpenAI chat completion endpoints stop responding to my requests. This happens every time I use the Docker with these parameters for Lorax launcher: Model: meta-llama/Meta-Llama-3-8B-Instruct max_batch_prefill_tokens: 27000 max_batch_total_tokens: 32000 max_input_length: 27000 max_total_tokens: 32000 adapter_memory_fraction: 0.2 The endpoints /health and /info do indeed work, but the generation endpoints stop responding. I have looked at the GPU utilization for any activity, and it does indeed utilize the GPU when requested from Swagger but does not respond. I use 8 different adapters to test Lorax. Also, the logger does not seem to even see the request, suggesting that there is some type of deadlock before processing the request. The requests are simple and are at most 10,000 tokens.
NOTE: In some cases I could not even reach the swagger, but this did not happen that often.
nvidia-smi : +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA H100 80GB HBM3 On | 00000000:91:00.0 Off | 0 | | N/A 38C P0 285W / 700W | 60146MiB / 81559MiB | 64% Default | | | | Disabled | +-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [ ] My own modifications
Reproduction
- launch lorax using these parameters, everything else is defult (I'm using a pre-downloaded model locally): Model: meta-llama/Meta-Llama-3-8B-Instruct max_batch_prefill_tokens: 27000 max_batch_total_tokens: 32000 max_input_length: 27000 max_total_tokens: 32000 adapter_memory_fraction: 0.2
- constantly send request to the chat completion endpoint. after a while you will get no response and timeout.
Expected behavior
lorax not hanging when requested concurrently