vllm Accelerate LLaMA model loading

Accelerate LLaMA model loading

Open JF-D opened this issue 1 year ago • 1 comments

This PR is for accelerating LLaMA model weights loading with safetensors. I find current load weight implementation doubles the time cost as the tensor-model parallelism increases (refer to the belowing loading time table for LLaMA-65B).

Parallelism Degree	Original (minutes)	Safetensors (minutes)
1	~5	~5
2	~10	~5
4	~10	~5

I think it is ready for review. Code adapted from https://github.com/huggingface/text-generation-inference/blob/v0.8.2/server/text_generation_server/models/flash_llama.py#L206