text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Add multi-GPU support to train

Open stuxnet147 opened this issue 1 year ago • 4 comments

Hello

I'm using 4 GPUs, but it's estimated that I'm using only 1 GPU during learning.

If possible, I would appreciate it if you could add a feature that allows me to use multi-GPU.

image

stuxnet147 avatar Apr 12 '23 21:04 stuxnet147

This would be a killer feature... I agree

practical-dreamer avatar Apr 13 '23 02:04 practical-dreamer

I suggest using the training script in https://github.com/tloen/alpaca-lora directly. Multigpu requires torchrun, which is a mutiprocess structure too hard to manage in a webui. You should use a script instead.

sgsdxzy avatar Apr 13 '23 04:04 sgsdxzy

I've been intending to figure out getting this working in the webui, but, uh, the limitation is I don't have a multi-GPU setup to test with currently.

mcmonkey4eva avatar Apr 14 '23 07:04 mcmonkey4eva

I suggest using the training script in https://github.com/tloen/alpaca-lora directly. Multigpu requires torchrun, which is a mutiprocess structure too hard to manage in a webui. You should use a script instead.

Couldn’t we just make the webui manage a torchrun / deep speed process? Or hell just make the webUI launch the script…

practical-dreamer avatar May 03 '23 21:05 practical-dreamer

I suggest using the training script in https://github.com/tloen/alpaca-lora directly. Multigpu requires torchrun, which is a mutiprocess structure too hard to manage in a webui. You should use a script instead.

Just wanted to quickly update this... me and some other fine colleagues have managed to get distributed parallel working by using accelerate to launch both the tloen alpaca trainer and axolotl. Huge increase in performance and temps observed by working both GPUs at once. In one instance we observed an ETA decrease from 110 Hours to 40 Hours for 2048 context llama-7b LoRA finetune split across 2 3090's with NVLINK.

It should be noted that this requires significantly more VRAM as the micro_batch_size is loaded onto each device instead of being split... And also while we tried to keep the hyperparameters consistent in our comparison it is possible we missed something... Finally this uplift in performance is likely multi-gpu exclusive as accelerate allows us to distribute the train...

Regardless I just wanted to post my observations... feel free to view the runs on wandb

Ooba's TextGen Trainer (Non-distributed parallel) https://wandb.ai/vicunlocked/VicUnlocked-7b/runs/lv8xluf7?workspace=user-practicaldreamer image

Axolotl launched through accelerate (Distributed parallel) https://wandb.ai/vicunlocked/VicUnlocked-7b/runs/cms3bb81?workspace=user-practicaldreamer image

practical-dreamer avatar May 16 '23 10:05 practical-dreamer

super agree

choigawoon avatar Jun 02 '23 08:06 choigawoon

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

github-actions[bot] avatar Sep 03 '23 23:09 github-actions[bot]

Has this been pushed to the repo yet? I would like to use multiple GPUs to train.

norton-chris avatar Sep 25 '23 04:09 norton-chris