lightllm Support automatically calculate max_total_token

In ApiServerArgs.md, an algorithm was introduced to calculate the optimal max_total_token_num argument. This process can be automated, and this PR introduces this feature.

The max_total_token_num argument now defaults to None. If not set, the API server will automatically calculate the optimal setting according to total GPU RAM and model size. A ratio of 0.8 will also be applied to ensure enough memory is reserved for inference.

Docs have also been updated.

Aug 17 '23 05:08 singularity-s0

Thanks for your great PR! We are refactoring part of our code and will merge your PR as soon as the refactored version is ready. Besides, hope to add a WeChat friend with you. (hao95111)

Aug 17 '23 14:08 XHPlus

@singularity-s0 Hello, Can this feature be modified to support all models? Because different models may have different calculation methods（GQA model is different）, should the implementation of this feature be bound to each individual model instance?

Aug 21 '23 04:08 hiworldwzj

Hi,

I'm not entirely sure how GQA or other implementations affect the use of GPU memory, could you please elaborate?

Generally, the formula is max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / kv_cache_size according to the docs.

total_free_gpu_memory is read using PyTorch CUDA API. This should be the ideal implementation.
model_parameter_size is estimated from the size of weight files on disk. This should mostly be accurate, unless some kind of compression is used, which I'm unaware of.
kv_cache_size should be dependent on model. If config.json provide enough information to calculate this value for each model, then model-specific implementations are not required. However I'm not sure if this is always the case (maybe GQA somehow affects this?)
Some implementations may require additional memory (maybe GQA?). Either config.json tell us enough information or we need model-specific implementations.

Aug 21 '23 06:08 singularity-s0

@singularity-s0 kv_cache_size is more different in the model that use GQA. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"

Aug 21 '23 06:08 hiworldwzj

From my understanding of the paper mentioned above, GQA reduces the kv_cache_size by num_attention_heads / num_key_value_heads times. These values are available from config.json so the value of kv_cache_size can always be calculated.

The new formula will be max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / original_kv_cache_size * num_attention_heads / num_key_value_heads

For models that do not use GQA, simply default num_key_value_heads to num_attention_heads. All current models would be supported this way.

Is my understanding correct?

Aug 21 '23 07:08 singularity-s0

@singularity-s0 Yes, you are right.

Aug 21 '23 08:08 hiworldwzj

This PR has been updated with changes to how kv_cache_size is calculated. Please review.

Aug 21 '23 08:08 singularity-s0

Support automatically calculate max_total_token_num