vllm
vllm copied to clipboard
Accelerate LLaMA model loading
This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing loading time table for LLaMA-65B).
Parallelism Degree | Original (minutes) | Safetensors (minutes) |
---|---|---|
1 | ~5 | ~5 |
2 | ~10 | ~5 |
4 | ~10 | ~5 |
I think it is ready for review. Code adapted from https://github.com/huggingface/text-generation-inference/blob/v0.8.2/server/text_generation_server/models/flash_llama.py#L206
Oh sorry, didn't mean to do that. :P
@AlpinDale Can merge this? Currently model loading are extremly slow
@JF-D could you please add some comments to your changes? A tad hard to read them at the moment :grimacing:
Resolve conflicts for reference.
@zhuohan123 I think it's possible, and I've updated the hf_model_weights_iterator
function. Maybe you can review it.