sglang icon indicating copy to clipboard operation
sglang copied to clipboard

Contradictory suggestions: Not enough memory. Please try to increase --mem-fraction-static

Open sneglen opened this issue 1 year ago • 4 comments
trafficstars

Q: Should I increase or decrease --mem-fraction-static? (and what is the minimum and maximum value allowed?)

Looking in the source code (python/sglang/srt/managers/router/model_runner.py) I would believe that increasing the value would alleviate the memory requirements but I might be interpreting it wrong. Just wanted to inform that there is a mismatch between the advice given in documentation and the advice given in the actual code.

Description of the problem:

I am trying to launch Mistral-7B-Instruct-v0.2 (using sglang==0.1.13):

python -m sglang.launch_server --model-path /llm_path/hf_model_mistral_7B_Instruct_v0_2 --port 30000

but I have memory issues. At the end it is suggested to increase --mem-fraction-static.

However, in the documentation (https://github.com/sgl-project/sglang) the opposite advice is given:

If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static. The default value is 0.9

Keep up the good work :)

/sneglen

Here is the error: Process Process-1: router init state: Traceback (most recent call last): File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/manager.py", line 68, in start_router_process model_client = ModelRpcClient(server_args, port_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 619, in __init__ self.model_server.exposed_init_model(0, server_args, port_args) File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_rpc.py", line 70, in exposed_init_model self.model_runner = ModelRunner( ^^^^^^^^^^^^ File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 272, in __init__ self.init_memory_pool(total_gpu_memory) File "/zhome/ac/8/105765/venv/env_MT/lib/python3.11/site-packages/sglang/srt/managers/router/model_runner.py", line 331, in init_memory_pool raise RuntimeError( RuntimeError: Not enought memory. Please try to increase --mem-fraction-static.

sneglen avatar Mar 22 '24 12:03 sneglen

There are three types of memory in SGLang:

  1. memory for model weights.
  2. memory for KV cache.
  3. temporary memory for intermediate computing results.

The answer to your question is: it cannot be too large or too small, as we need enough memory to load the model weight and we also need spare memory for intermediate results.

Suppose your machine has 80GB GPU memory and the model weights take 60G memory, then if you set --mem-fraction-static to 0.9, the memory for KV cache is 80G * 0.9 - 60G = 12G, the memory for intermediate results is 80G * (1.0 - 0.9) = 8G.

hnyls2002 avatar Apr 07 '24 08:04 hnyls2002

Thank you for the clarification with the formulas. I better understand now the issue.

However I am still a bit puzzled and to me the guidance seems to be conflicting because it's conditional on the specific circumstances under which the error occurs. The error I encountered was:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

so I should increase --mem-fraction-static

On the other hand, according to the advice in sglang:

If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static

You mention that --mem-fraction-static cannot be too large or too small. Would it be possible to catch the memory error and prompt the current memory usage (and requirements if possible, but they depend on the submitted query to the LLM?) of "model weights", "KV cache" and "temporary memory for intermediate results". Then using the formulas you wrote, the user can make an informed guess of whether to increase or decrease --mem-fraction-static?

Currently I don't know whether I should increase or decrease --mem-fraction-static. It becomes just a guess.

sneglen avatar Apr 07 '24 10:04 sneglen

Thank you for the clarification with the formulas. I better understand now the issue.

However I am still a bit puzzled and to me the guidance seems to be conflicting because it's conditional on the specific circumstances under which the error occurs. The error I encountered was:

RuntimeError: Not enough memory. Please try to increase --mem-fraction-static.

so I should increase --mem-fraction-static

On the other hand, according to the advice in sglang:

If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of --mem-fraction-static

You mention that --mem-fraction-static cannot be too large or too small. Would it be possible to catch the memory error and prompt the current memory usage (and requirements if possible, but they depend on the submitted query to the LLM?) of "model weights", "KV cache" and "temporary memory for intermediate results". Then using the formulas you wrote, the user can make an informed guess of whether to increase or decrease --mem-fraction-static?

Currently I don't know whether I should increase or decrease --mem-fraction-static. It becomes just a guess.

I have the same question it's very confusing, did you get any solution?

Iven2132 avatar May 15 '24 16:05 Iven2132

I set --mem-fraction-static to 0.9 which seems to be a reasonable value (?) and ended up using a A100 (40GB) which for my case is more than enough for inference.

If you are really stuck you could consider examining the arguments passed to sglang:

https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/server_args.py

See "Optimization/debug options" attention_reduce_in_fp32 could be relevant.

if you are in control of the LLM you train, you could consider saving the LLM as a 8-bit or even 4-bit version.

Here there are some tricks: https://huggingface.co/docs/transformers/v4.35.0/en/llm_tutorial_optimization

sneglen avatar May 15 '24 17:05 sneglen

This issue has been automatically closed due to inactivity. Please feel free to reopen it if needed.

github-actions[bot] avatar Jul 25 '24 06:07 github-actions[bot]