text-generation-inference How to solve "Model is overloaded" when sending 500 requests?

System Info

2023-06-12T09:06:08.879916Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: abd58ff82c37d5e4f131abdac3d298927a815604
Docker label: N/A
nvidia-smi:
Mon Jun 12 18:06:07 2023
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 515.65.07    Driver Version: 515.65.07    CUDA Version: 11.8     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  On   | 00000000:10:00.0 Off |                  Off |
   | N/A   35C    P0    86W / 400W |  27436MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   1  NVIDIA A100-SXM...  On   | 00000000:16:00.0 Off |                  Off |
   | N/A   33C    P0    87W / 400W |  27580MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   2  NVIDIA A100-SXM...  On   | 00000000:49:00.0 Off |                  Off |
   | N/A   34C    P0    89W / 400W |  27580MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   3  NVIDIA A100-SXM...  On   | 00000000:4D:00.0 Off |                  Off |
   | N/A   33C    P0    92W / 400W |  27578MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   4  NVIDIA A100-SXM...  On   | 00000000:C5:00.0 Off |                  Off |
   | N/A   34C    P0    91W / 400W |  27578MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   5  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                  Off |
   | N/A   34C    P0    90W / 400W |  27580MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   6  NVIDIA A100-SXM...  On   | 00000000:E3:00.0 Off |                  Off |
   | N/A   33C    P0    96W / 400W |  27582MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
   |   7  NVIDIA A100-SXM...  On   | 00000000:E7:00.0 Off |                  Off |
   | N/A   35C    P0    90W / 400W |  27436MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

Information

[X] Docker
[X] The CLI directly

Tasks

[ ] An officially supported command
[X] My own modifications

Reproduction

successfully launch the server (in my case Vicuna-7b on a A100 node with 8 GPUs)
send 500 asynchronous requests with such code: responses = await asyncio.gather(*[acreate(url, payload) for u in urls])
the first 50~120 requests are properly batched and responded to. However, the server instead responds with the following object: {'error': 'Model is overloaded', 'error_type': 'overloaded'}

Here's some sample code

async def acreate(url: str, payload: dict):
    header = {"Content-Type": "application/json"}
    async with aiohttp.ClientSession() as session:
        async with session.post(url, headers=header, data=json.dumps(payload)) as response:
            resp = json.loads(await response.text())
            return (response.ok, resp)

    # return response


async def main():
    url = "http://{YOUR_URL}/generate"
    params = {
        "best_of": 1,
        # "decoder_input_details": True,
        # "details": True,
        "do_sample": True,
        "max_new_tokens": 100,
        "repetition_penalty": 1.03,
        "return_full_text": False,
        # "seed": None,
        "stop": [],
        "temperature": 0.5,
        "top_k": 50,
        "top_p": 0.8,
        # "truncate": "null",
        "typical_p": 0.95,
        # "watermark": False,
    }
    payload = {
        "inputs": "What is Deep Learning?", 
        "parameters": params,
    }
    
    urls = [url] * 500
    # responses = await asyncio.gather(
    #     acreate(url[0], payload),
    #     acreate(url[1], payload),
    #     acreate(url[2], payload),
    #     acreate(url[3], payload),
    #     acreate(url[4], payload),
    # )
    responses = await asyncio.gather(*[acreate(url, payload) for u in urls])

    print()
    print("Responses:")
    for i, (ok, response) in enumerate(responses, start=1):
        if ok:
            print(f"Response {i}: {response['generated_text']}")
        else:
            print(f"Response {i}: {response['error']}")
        print()
            
asyncio.run(main())

Expected behavior

I expect either

the requests to be queued properly and responded in order
or a more detailed error message than error: Model is overloaded

Jun 12 '23 09:06 jshin49

Hi @jshin49 ,

This is working as intended, this mechanism is called preventing backpressure. Basically when your server is getting saturated, you want to refuse new requests, otherwise you will still be serving super old requests, and most likely the client will timeout or cut the connection, and you're going to waste some work.

Erroring out early is better than serving in order old requests (in a realistic setting).

You can change --max-concurrent-requests if you want to increase that setting for you (Making it arbitrarily large will give the behavior you want, although I wouldn't recommend it in production settings).

As for the error message, There's nothing more to add really, the server is refusing your requests because it is under load and cannot handle it.

Jun 12 '23 09:06 Narsil

To complete what @Narsil just said, what you would usually do instead is adding a rate limiter on the client side to avoid overloading the server (for example, limit the number of in-flight requests to 64 instead of sending the 500 requests in one batch). If you don't do this, some of your requests might timeout because they will be sitting in the queue for an extended period of time.

Jun 12 '23 11:06 OlivierDehaene

I see! thanks for the responses. It totally makes sense. Is there a way to set-up the timeout as well?

Jun 13 '23 02:06 jshin49

https://github.com/huggingface/text-generation-inference/blob/main/clients/python/text_generation/client.py#L285

Jun 13 '23 08:06 OlivierDehaene

Are the number and the quality of GPUs used will influence the number of --max-concurrent-requests ? If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ?

Let's imagine i want to deploy for 10 000 users, which hardware will be necessary ?

Jun 23 '23 15:06 ArnaudHureaux

If i'am using 8 A100 GPUs, can i have a bigger --max-concurrent-requests than if i'am using only 2 ?

Yes, if you shard across the 8 GPUS, --num-shard 8.

Let's imagine i want to deploy for 10 000 users, which hardware will be necessary ?

This is impossible to answer without a lot more details.

For a full answered looking in detail at your specific needs, huggingface has a expert acceleration program: https://huggingface.co/support which might suit you.

Now I can give a gist:

The larger the model the less throughput/latency you're going to get.
Models have 2 regimes memory bound or compute bound. With 8 A100 you're likely to be memory bound (but this would impact the rest)
Once you're memory bound, it means you can "batch for free", meaning stacking requests on the GPU is "free" (you will run 2 requests in roughly the same amount of time as 1).
Since it's free, you need to set all the parameters to the maximum that still stays lower than the required memory to run the model. The best parameter to control that is --max-batch-total-tokens which is the sum of all tokens (either 200 means either 100 tokens spread over 2 requests or 1 request of 200 tokens).

And then it's all about looking at metrics, checking what users are actually doing on your system and tuning things depending on your context. (Are requests long ? Generating many tokens ?) etc..

This is the best I can do without having specifics.

Jun 23 '23 16:06 Narsil

Are there best ways to determine what should be my ayncio limiter's RPM to avoid model overloading. I am using just one GPU to host the model.

Also, how can this be increased? For example by hosting on multiple GPU? I am not super familiar with multi-GPU or sharding. Any pointers to understand this better? Any tips / pointers to making inference faster / parallel would help. Thanks!

Jan 30 '24 17:01 shaily99

This should help you get started: https://www.youtube.com/watch?v=jlMAX2Oaht0

Jan 31 '24 11:01 Narsil

text-generation-inference text-generation-inference copied to clipboard

How to solve "Model is overloaded" when sending 500 requests?

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard