sglang
sglang copied to clipboard
Contradictory suggestions: Not enough memory. Please try to increase --mem-fraction-static
Q: Should I increase or decrease --mem-fraction-static? (and what is the minimum and maximum value allowed?)
Looking in the source code (python/sglang/srt/managers/router/model_runner.py) I would believe that increasing the value would alleviate the memory requirements but I might be interpreting it wrong. Just wanted to inform that there is a mismatch between the advice given in documentation and the advice given in the actual code.
Description of the problem:
I am trying to launch Mistral-7B-Instruct-v0.2 (using sglang==0.1.13):
python -m sglang.launch_server --model-path /llm_path/hf_model_mistral_7B_Instruct_v0_2 --port 30000
but I have memory issues. At the end it is suggested to increase --mem-fraction-static.
However, in the documentation (https://github.com/sgl-project/sglang) the opposite advice is given:
If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9
Keep up the good work :)
/sneglen
Here is the error:
Process Process-1: router init state: Traceback (most recent call last): File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process model_client = ModelRpcClient(server_args, port_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 619, in __init__ self.model_server.exposed_init_model(0, server_args, port_args) File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 70, in exposed_init_model self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 272, in __init__ self.init_memory_pool(total_gpu_memory) File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 331, in init_memory_pool raise RuntimeError( RuntimeError: Not enought memory. Please try to increase --mem-fraction-static.
There are three types of memory in SGLang:
- memory for model weights.
- memory for KV cache.
- temporary memory for intermediate computing results.
The answer to your question is: it cannot be too large or too small, as we need enough memory to load the model weight and we also need spare memory for intermediate results.
Suppose your machine has 80GB GPU memory and the model weights take 60G memory, then if you set --mem-fraction-static to 0.9, the memory for KV cache is 80G * 0.9 - 60G = 12G, the memory for intermediate results is 80G * (1.0 - 0.9) = 8G.
Thank you for the clarification with the formulas. I better understand now the issue.
However I am still a bit puzzled and to me the guidance seems to be conflicting because it's conditional on the specific circumstances under which the error occurs. The error I encountered was:
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
so I should increase --mem-fraction-static
On the other hand, according to the advice in sglang:
If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static
You mention that --mem-fraction-static cannot be too large or too small. Would it be possible to catch the memory error and prompt the current memory usage (and requirements if possible, but they depend on the submitted query to the LLM?) of "model weights", "KV cache" and "temporary memory for intermediate results". Then using the formulas you wrote, the user can make an informed guess of whether to increase or decrease --mem-fraction-static?
Currently I don't know whether I should increase or decrease --mem-fraction-static. It becomes just a guess.
Thank you for the clarification with the formulas. I better understand now the issue.
However I am still a bit puzzled and to me the guidance seems to be conflicting because it's conditional on the specific circumstances under which the error occurs. The error I encountered was:
RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.
so I should increase
--mem-fraction-staticOn the other hand, according to the advice in sglang:
If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-staticYou mention that
--mem-fraction-staticcannot be too large or too small. Would it be possible to catch the memory error and prompt the current memory usage (and requirements if possible, but they depend on the submitted query to the LLM?) of "model weights", "KV cache" and "temporary memory for intermediate results". Then using the formulas you wrote, the user can make an informed guess of whether to increase or decrease--mem-fraction-static?Currently I don't know whether I should increase or decrease
--mem-fraction-static. It becomes just a guess.
I have the same question it's very confusing, did you get any solution?
I set --mem-fraction-static to 0.9 which seems to be a reasonable value (?) and ended up using a A100 (40GB) which for my case is more than enough for inference.
If you are really stuck you could consider examining the arguments passed to sglang:
https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py
See "Optimization/debug options" attention_reduce_in_fp32 could be relevant.
if you are in control of the LLM you train, you could consider saving the LLM as a 8-bit or even 4-bit version.
Here there are some tricks: https://huggingface.co/docs/transformers/v4.35.0/en/llm_tutorial_optimization
This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.