text-generation-inference can't start server with small --max-total-tokens. But works fine with big stting

when I try to run CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 64--max-total-tokens 128 --max-batch-prefill-tokens 128 --cuda-memory-fraction 0.95. It says

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB. GPU has a total capacity of 44.53 GiB of which 1.94 MiB is free. Process 123210 has 44.52 GiB memory in use. Of the allocated memory 40.92 GiB is allocated by PyTorch, and 754.08 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

But for sitting big max tokens. CUDA_VISIBLE_DEVICES=0,1,2,3 text-generation-launcher --port 6634 --model-id /models/ --max-concurrent-requests 128 --max-input-length 1024 --max-total-tokens 2048 --max-batch-prefill-tokens 2048 --cuda-memory-fraction 0.95. it works fine.

i don't get it why small max tokens cause CUDA out of memory but large max tokens works fine. Can someone answer my questions?

Jul 18 '24 07:07 rooooc

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

Jul 18 '24 09:07 Hugoch

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

ok. i got it. but why when the max-total-tokens is large, like 2048, it works ok. when it comes to 64, its doesn't work? i am not setting max batch total tokens for both of them

Jul 18 '24 11:07 rooooc

Hello @rooooc!

Your issue probably relates to not setting max-batch-total-tokens (https://huggingface.co/docs/text-generation-inference/en/basic_tutorials/launcher#maxbatchtotaltokens). By setting different values for max-total-tokens and max-batch-prefill-tokens you are not controlling the max tokens that can be batched which will control the total max GPU memory that can be used.

i have set the max batch total tokens, its still not working.

Jul 18 '24 11:07 rooooc

@rooooc , you should be able to reduce the max-batch-total-tokens until you have an acceptable value for your GPU memory. As stated in doc:

Overall this number should be the largest possible amount that fits the remaining memory (after the model is loaded).

If you OOM, it should be reduced further.

Jul 18 '24 12:07 Hugoch

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 18 '24 01:08 github-actions[bot]