lorax Lorax Stop Responding after non-concurrent/concurrent requests

Lorax Stop Responding after non-concurrent/concurrent requests

Open MAHMUTGOKSU opened this issue 2 months ago • 0 comments

System Info

Hi, I'm currently working on using Lorax to test, benchmark, and serve my LoRA adapters, but I keep facing the same problem. The issue is that after concurrent or non-concurrent requests (about ~195), the OpenAI chat completion endpoints stop responding to my requests. This happens every time I use the Docker with these parameters for Lorax launcher: Model: meta-llama/Meta-Llama-3-8B-Instruct max_batch_prefill_tokens: 27000 max_batch_total_tokens: 32000 max_input_length: 27000 max_total_tokens: 32000 adapter_memory_fraction: 0.2 The endpoints /health and /info do indeed work, but the generation endpoints stop responding. I have looked at the GPU utilization for any activity, and it does indeed utilize the GPU when requested from Swagger but does not respond. I use 8 different adapters to test Lorax. Also, the logger does not seem to even see the request, suggesting that there is some type of deadlock before processing the request. The requests are simple and are at most 10,000 tokens.

NOTE: In some cases I could not even reach the swagger, but this did not happen that often.

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+

Information

[x] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[ ] My own modifications

Reproduction

launch lorax using these parameters, everything else is defult (I'm using a pre-downloaded model locally): Model: meta-llama/Meta-Llama-3-8B-Instruct max_batch_prefill_tokens: 27000 max_batch_total_tokens: 32000 max_input_length: 27000 max_total_tokens: 32000 adapter_memory_fraction: 0.2
constantly send request to the chat completion endpoint. after a while you will get no response and timeout.

Expected behavior

lorax not hanging when requested concurrently

Sep 04 '25 15:09 MAHMUTGOKSU

lorax lorax copied to clipboard

Lorax Stop Responding after non-concurrent/concurrent requests

System Info

Information

Tasks

Reproduction

Expected behavior

lorax
lorax copied to clipboard