Support automatically calculate max_total_token_num
In ApiServerArgs.md, an algorithm was introduced to calculate the optimal max_total_token_num argument. This process can be automated, and this PR introduces this feature.
The max_total_token_num argument now defaults to None. If not set, the API server will automatically calculate the optimal setting according to total GPU RAM and model size. A ratio of 0.8 will also be applied to ensure enough memory is reserved for inference.
Docs have also been updated.
Thanks for your great PR! We are refactoring part of our code and will merge your PR as soon as the refactored version is ready. Besides, hope to add a WeChat friend with you. (hao95111)
@singularity-s0 Hello, Can this feature be modified to support all models? Because different models may have different calculation methods(GQA model is different), should the implementation of this feature be bound to each individual model instance?
Hi,
I'm not entirely sure how GQA or other implementations affect the use of GPU memory, could you please elaborate?
Generally, the formula is max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / kv_cache_size according to the docs.
total_free_gpu_memoryis read using PyTorch CUDA API. This should be the ideal implementation.model_parameter_sizeis estimated from the size of weight files on disk. This should mostly be accurate, unless some kind of compression is used, which I'm unaware of.kv_cache_sizeshould be dependent on model. Ifconfig.jsonprovide enough information to calculate this value for each model, then model-specific implementations are not required. However I'm not sure if this is always the case (maybe GQA somehow affects this?)- Some implementations may require additional memory (maybe GQA?). Either
config.jsontell us enough information or we need model-specific implementations.
@singularity-s0 kv_cache_size is more different in the model that use GQA. "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"
From my understanding of the paper mentioned above, GQA reduces the kv_cache_size by num_attention_heads / num_key_value_heads times. These values are available from config.json so the value of kv_cache_size can always be calculated.
The new formula will be
max_total_token_num = (total_free_gpu_memory - model_parameter_size) * 0.8 / original_kv_cache_size * num_attention_heads / num_key_value_heads
For models that do not use GQA, simply default num_key_value_heads to num_attention_heads. All current models would be supported this way.
Is my understanding correct?
@singularity-s0 Yes, you are right.
This PR has been updated with changes to how kv_cache_size is calculated. Please review.