vllm icon indicating copy to clipboard operation
vllm copied to clipboard

Accelerate LLaMA model loading

Open JF-D opened this issue 1 year ago • 1 comments

This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing loading time table for LLaMA-65B).

Parallelism Degree Original (minutes) Safetensors (minutes)
1 ~5 ~5
2 ~10 ~5
4 ~10 ~5

I think it is ready for review. Code adapted from https://github.com/huggingface/text-generation-inference/blob/v0.8.2/server/text_generation_server/models/flash_llama.py#L206

JF-D avatar Jun 25 '23 07:06 JF-D

Oh sorry, didn't mean to do that. :P

AlpinDale avatar Jun 28 '23 17:06 AlpinDale

@AlpinDale Can merge this? Currently model loading are extremly slow

lucasjinreal avatar Jul 12 '23 05:07 lucasjinreal

@JF-D could you please add some comments to your changes? A tad hard to read them at the moment :grimacing:

creatorrr avatar Jul 18 '23 08:07 creatorrr

Resolve conflicts for reference.

JF-D avatar Jul 19 '23 12:07 JF-D

@zhuohan123 I think it's possible, and I've updated the hf_model_weights_iterator function. Maybe you can review it.

JF-D avatar Aug 05 '23 07:08 JF-D