text-generation-inference
text-generation-inference copied to clipboard
How to solve "Model is overloaded" when sending 500 requests?
System Info
2023-06-12T09:06:08.879916Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: abd58ff82c37d5e4f131abdac3d298927a815604
Docker label: N/A
nvidia-smi:
Mon Jun 12 18:06:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.07 Driver Version: 515.65.07 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:10:00.0 Off | Off |
| N/A 35C P0 86W / 400W | 27436MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:16:00.0 Off | Off |
| N/A 33C P0 87W / 400W | 27580MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:49:00.0 Off | Off |
| N/A 34C P0 89W / 400W | 27580MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:4D:00.0 Off | Off |
| N/A 33C P0 92W / 400W | 27578MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:C5:00.0 Off | Off |
| N/A 34C P0 91W / 400W | 27578MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | Off |
| N/A 34C P0 90W / 400W | 27580MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:E3:00.0 Off | Off |
| N/A 33C P0 96W / 400W | 27582MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:E7:00.0 Off | Off |
| N/A 35C P0 90W / 400W | 27436MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
Information
- [X] Docker
- [X] The CLI directly
Tasks
- [ ] An officially supported command
- [X] My own modifications
Reproduction
- successfully launch the server (in my case Vicuna-7b on a A100 node with 8 GPUs)
- send 500 asynchronous requests with such code:
responses = await asyncio.gather(*[acreate(url, payload) for u in urls]) - the first 50~120 requests are properly batched and responded to. However, the server instead responds with the following object:
{'error': 'Model is overloaded', 'error_type': 'overloaded'}
Here's some sample code
async def acreate(url: str, payload: dict):
header = {"Content-Type": "application/json"}
async with aiohttp.ClientSession() as session:
async with session.post(url, headers=header, data=json.dumps(payload)) as response:
resp = json.loads(await response.text())
return (response.ok, resp)
# return response
async def main():
url = "http://{YOUR_URL}/generate"
params = {
"best_of": 1,
# "decoder_input_details": True,
# "details": True,
"do_sample": True,
"max_new_tokens": 100,
"repetition_penalty": 1.03,
"return_full_text": False,
# "seed": None,
"stop": [],
"temperature": 0.5,
"top_k": 50,
"top_p": 0.8,
# "truncate": "null",
"typical_p": 0.95,
# "watermark": False,
}
payload = {
"inputs": "What is Deep Learning?",
"parameters": params,
}
urls = [url] * 500
# responses = await asyncio.gather(
# acreate(url[0], payload),
# acreate(url[1], payload),
# acreate(url[2], payload),
# acreate(url[3], payload),
# acreate(url[4], payload),
# )
responses = await asyncio.gather(*[acreate(url, payload) for u in urls])
print()
print("Responses:")
for i, (ok, response) in enumerate(responses, start=1):
if ok:
print(f"Response {i}: {response['generated_text']}")
else:
print(f"Response {i}: {response['error']}")
print()
asyncio.run(main())
Expected behavior
I expect either
- the requests to be queued properly and responded in order
- or a more detailed error message than
error: Model is overloaded
Hi @jshin49 ,
This is working as intended, this mechanism is called preventing backpressure. Basically when your server is getting saturated, you want to refuse new requests, otherwise you will still be serving super old requests, and most likely the client will timeout or cut the connection, and you're going to waste some work.
Erroring out early is better than serving in order old requests (in a realistic setting).
You can change --max-concurrent-requests if you want to increase that setting for you (Making it arbitrarily large will give the behavior you want, although I wouldn't recommend it in production settings).
As for the error message, There's nothing more to add really, the server is refusing your requests because it is under load and cannot handle it.
To complete what @Narsil just said, what you would usually do instead is adding a rate limiter on the client side to avoid overloading the server (for example, limit the number of in-flight requests to 64 instead of sending the 500 requests in one batch). If you don't do this, some of your requests might timeout because they will be sitting in the queue for an extended period of time.
I see! thanks for the responses. It totally makes sense. Is there a way to set-up the timeout as well?
https://github.com/huggingface/text-generation-inference/blob/main/clients/python/text_generation/client.py#L285
Are the number and the quality of GPUs used will influence the number of --max-concurrent-requests ?
If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ?
Let's imagine i want to deploy for 10 000 users, which hardware will be necessary ?
If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ?
Yes, if you shard across the 8 GPUS, --num-shard 8.
Let's imagine i want to deploy for 10 000 users, which hardware will be necessary ?
This is impossible to answer without a lot more details.
For a full answered looking in detail at your specific needs, huggingface has a expert acceleration program: https://huggingface.co/support which might suit you.
Now I can give a gist:
- The larger the model the less throughput/latency you're going to get.
- Models have 2 regimes memory bound or compute bound. With 8 A100 you're likely to be memory bound (but this would impact the rest)
- Once you're memory bound, it means you can "batch for free", meaning stacking requests on the GPU is "free" (you will run 2 requests in roughly the same amount of time as 1).
- Since it's free, you need to set all the parameters to the maximum that still stays lower than the required memory to run the model. The best parameter to control that is
--max-batch-total-tokenswhich is the sum of all tokens (either 200 means either 100 tokens spread over 2 requests or 1 request of 200 tokens).
And then it's all about looking at metrics, checking what users are actually doing on your system and tuning things depending on your context. (Are requests long ? Generating many tokens ?) etc..
This is the best I can do without having specifics.
Are there best ways to determine what should be my ayncio limiter's RPM to avoid model overloading. I am using just one GPU to host the model.
Also, how can this be increased? For example by hosting on multiple GPU? I am not super familiar with multi-GPU or sharding. Any pointers to understand this better? Any tips / pointers to making inference faster / parallel would help. Thanks!
This should help you get started: https://www.youtube.com/watch?v=jlMAX2Oaht0