llama icon indicating copy to clipboard operation
llama copied to clipboard

Distributing LLaMA on multiple machines within the same network

Open fabawi opened this issue 1 year ago • 5 comments

Using torch.distribution and fairscale, LLaMA can be parallelized on multiple devices or machines, which works quite well already. However, each GPU device is expected to have a large VRAM since weights are loaded onto all. I've seen quite a few solutions, some involved offloading the model in part or as a whole to the CPU while others reduced the weight resolution. Using a meta device to load the weights could also help reduce the burden on each GPU by initializing the model only once the weights are set for each layer. Then again, this only helps when loading weights so you wouldn't run out of memory on initialization. Most approaches, if not all, as far as I can tell, assume the model weights are loaded on every GPU, atleast initially.

To solve this issue, I developed a LLaMA version distributed on multiple machines and GPUs using Wrapyfi (https://github.com/fabawi/wrapyfi). The outputs of the Transformer blocks are split (similar to fairscale pipelines but more controllable) and transmitted through ZeroMQ; The performance seems better than variants running on CPU and more accurate than 8-bit variants (I haven't verified the latter, this is purely based on what the corresponding developers state). I tried the approach on 7B and 13B, and in theory, it should work on the larger models. I will try it on larger variants soon, but until then, I would appreciate feedback on what works and what doesn't.

https://github.com/modular-ml/wrapyfi-examples_llama

fabawi avatar Mar 10 '23 22:03 fabawi

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

yokie121 avatar Mar 13 '23 03:03 yokie121

it's greate !!

we want to know how to work on the larger models 13B or 65B. ? thanks !

Thanks @yokie121 ! Checkout the example in the repo's readme, under Running 7B on 4 machines

The same applies to larger models. If the 2 model variant worked, and then the 4 machines/GPU devices worked, then it should work on larger models if you have sufficient VRAM. To run on 13B, do not change nproc_per_node. This is always 1 with our version of LLaMA. What you do instead is change the model location to 13B, --wrapyfi_device_idx, and --wrapyfi_total_devices

fabawi avatar Mar 13 '23 07:03 fabawi

hello fabwai, can please tell how did you load model on multiple GPU's and fine tune the model?

leo-a11 avatar Jun 09 '23 12:06 leo-a11

Any updates?

Negashev avatar Nov 10 '23 18:11 Negashev

Meantime the Distrubuted Llama project was created (by me 😅).

b4rtaz avatar Jan 20 '24 21:01 b4rtaz