aphrodite-engine [Feature]: Automatic max-model-len or max-num-seqs

🚀 The feature, motivation and pitch

Feature Request: Automatically Adjust --max-model-len and --max-num-seqs Based on GPU Memory, Cache Size, and Other Parameters

Problem to Solve: Currently, maximizing GPU memory usage in Aphrodite-engine requires trial and error to determine an appropriate balance between model length (--max-model-len) and the number of sequences (--max-num-seqs). This process involves multiple launches of the engine to assess cache availability after model loading, as well as determining how many sequences can be supported for a reasonable model length (e.g., 4096).

When starting the Aphrodite-engine, the log provides helpful information:

...
INFO:     # GPU blocks: 3385, # CPU blocks: 0
INFO:     Minimum concurrency: 3.31x
INFO:     Maximum sequence length allowed in the cache: 54160
...

However, if the model length is set slightly higher than the cache allows, the engine adjusts it automatically:

ERROR:    The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (12176). Try incr...

This behavior suggests the engine can determine cache capacity and adjust settings dynamically. Yet, if no model length is specified, the engine defaults to the value in config.json (e.g., 131000), which may exceed the available memory and result in a CUDA OOM error.

Proposed Solution:

Upon loading the model, the engine should automatically limit --max-model-len to the highest value supported by the available GPU memory and cache size, factoring in parameters like -gmu and the total GPU memory size.
If --max-num-seqs is specified, the engine could divide the available cache proportionally to maximize GPU utilization while maintaining safe operation.
Alternatively, if --max-model-len is specified, the engine should calculate the maximum number of sequences (--max-num-seqs) that can safely fit in the cache. This would eliminate the need for manual trial and error, making the engine more user-friendly and efficient.

Additional Context: The ability to automatically balance these parameters seems feasible given the log output and current error-handling mechanisms. However, if there are technical constraints or complexities preventing this, clarification on the challenges involved would be helpful.

Alternatives

No response

Additional context

No response

Jan 05 '25 22:01 markouustalu

I'm not sure I understand the issue. If you need the engine to limit the max_model_len to the amount your GPU can fit, then we already handle that, as you showed yourself in the logs.

The "Maximum sequence length allowed in the cache" value reported in the logs is somewhat misleading, as it's a theoretical maximum based on the number of GPU blocks that could be allocated after the initial model profiling. This number isn't reliable either, and can vary significantly depending on your initial max_model_len setting because:

The profiling run uses max_model_len to determine how much memory to allocate during initialization
Higher max_model_len values lead to higher peak memory usage during profiling
Higher peak memory means fewer blocks can be allocated for the KV cache
Fewer blocks means a smaller actual maximum sequence length

This is why you might see different "Maximum sequence length allowed" values when starting the engine with different max_model_len settings, even on the same GPU with the same model.

As for automatically determining the optimal balance between max_model_len and max_num_seqs I don't think it's that feasible. The memory requirements depend on the specific usage patterns (e.g., many short sequences vs few long sequences), different models have different memory profiles for their attention mechanisms, and the relationship between these parameters and memory usage isn't strictly linear.

As an aside, you can try --single-user-mode, which limits max_num_seqs to 1, and allocates only enough memory for a single sequence. Combining this with --enable-chunked-prefill seems to use less memory than exllamav2 with a similarly-sized quant.

Jan 05 '25 23:01 AlpinDale

I'm not sure I understand the issue. If you need the engine to limit the max_model_len to the amount your GPU can fit, then we already handle that, as you showed yourself in the logs.

Well, yes and no. If I set the model len to slightly higher than would actually fit, then it is reduced. If i start the engine without specifying length, it is taken from config.json and then the engine dies without reducing:

:~$ CUDA_VISIBLE_DEVICES=0 ./venv/bin/aphrodite run hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4 --kv-cache-dtype fp8_e5m2
-enable-prefix-caching
INFO:     The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
...
INFO:     Context Length = 131072
...
INFO:     Total model weights memory usage: 5.36 GiB
INFO:     Profiling peak memory usage...
Process SpawnProcess-1:
Traceback (most recent call last):
...
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB. GPU 0 has a total capacity of 11.66 GiB...
...

...This is why you might see different "Maximum sequence length allowed" values when starting the engine with different max_model_len settings, even on the same GPU with the same model.

Thank you for this. This is explaining what i was seeing and i found no explanation for this. But this also illustrates my need very well. So, i take a model and load it up. I have no idea how much model-len i can use. I must enter something because entering nothing ends in failure. So i enter 1024. Upon launching i see that, wow, more is available. I see that 64k is available, i raise it to 32k and set max-num-seq to 2. When launching i am shown that now there is only 12k available and we have to reduce it. I need minimum 16k however, so i close and adjust max-num-seqs to 1 and model-len to 16k. This results in engine reporting that 28k is available. So it is a sort of cat and mouse game.

As for automatically determining the optimal balance between max_model_len and max_num_seqs I don't think it's that feasible. The memory requirements depend on the specific usage patterns (e.g., many short sequences vs few long sequences), different models have different memory profiles for their attention mechanisms, and the relationship between these parameters and memory usage isn't strictly linear.

Agreed. But i do not need you to determine the balance automatically. I will supply you balance by suppying you one of them, you determine the other based on what is possible.

Use case: I need to test different models and different settings (kv cache types and different -q quants) for models for best throughput. All parameters change available memory. My requirement would be to have let's say 2048 model-len, everything else should be dedicated to max-num-seqs. Like you also said, understanding how much cache really is available depends on how you launch it so this ends up being just launching many times until good enough situation is reached.

Jan 06 '25 00:01 markouustalu

As i thought, there is a lot going on internally in this wonderful software and not everthing in docs is dumbed down sufficiently.

After scouring through the issues and vLLM issues, i came to understand why i am having the problem i am describing. I do not understand the internals well (or at all) still, however, as i now understand, when launching the engine with max-model-len parameter set, max-num-batched-tokens is also set to this value. During profiling this amount of tokens must fit into GPU mem as well. This is reduced from available memory to kv cache. This creates this funny effect of reducing available GPU blocks magically when increasing max-model-len. However, when i set max-num-batched-tokens to a fixed value, like 512, the GPU blocks no longer changes when changing max-model-len. I can now, after a single engine launch, calculate exactly how long can a single sequence be, and i can set it and i can expect the engine to launch. No more series of launchings and inching closer to the maximum available model length. I cannot see in documentation where this is explained that these parameters are tied.

I still feel, that there should be an option to launch the engine so that simply maximum seq length is used that can be used without OOM.

Jan 26 '25 23:01 markouustalu