alpaca-lora Get stucked on multi GPUs training

When I was trying to train on multiple gpus, I used OMP_NUM_THREADS=4 WORLD_SIZE=2 CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=1234 finetune.py according to #8. It looks like

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00,  3.13s/it]Found cached dataset json (/home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 754.10it/s]trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199Loading cached split indices for dataset at /home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-f5849b60215c1f02.arrow and /home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-34eeb7d54c9d1847.arrowLoading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00,  4.02s/it]Map:  13%|███████████████▌                                                                                                            | 6212/49672 [00:01<00:11, 3754.93 examples/s]Found cached dataset json (/home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 739.34it/s]trainable params: 4194304 || all params: 6742609920 || trainable%: 0.06220594176090199

Loading cached split indices for dataset at /home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-f5849b60215c1f02.arrow and /home/huangqinlong/.cache/huggingface/datasets/json/default-b3d942d0bd09abdb/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51/cache-34eeb7d54c9d1847.arrow

But then I got stucked here. When I run nvidia-smi, I found both two GPU has 100% util, and the memory-usage was about 8300MB (I used llama-7b model). And there are two 4090 on my machine with CUDA11.8 installed.

While I force single gpu training, it was OK.

So how can I solve it?

Apr 02 '23 07:04 QinlongHuang

Did you solve it, I am having same issue

Apr 03 '23 18:04 imraviagrawal

Having exact same issue with 4x4090 on 7b training test. If I expose 1 GPU it works, but if 4 GPU using same torchrun as @QinlongHuang, everything is stuck with no progress output. Wandb doesn't even start. both gpu and cpu are at 100%.

Apr 04 '23 05:04 Qubitium

NOT totally solved. I find the problem was due to NCCL backend trying to use peer to peer (P2P) transport. So setting envionment variable NCCL_P2P_DISABLE=1 or NCCL_P2P_LEVEL=2 might fix the issue. Which means a command like NCCL_P2P_DISABLE=1 torchrun --nproc_per_node=2 --master=1234 fintune.py

However, it is worth noting that while disabling the nccl p2p, you may get low speed according to [1].

For instance, with default settings (batch_size=128, micro_batch_size=4), it gets ~14it/s (1164 iters for one epoch) in a single 4090 card. While I turn to a dual 4090 setting with batch_size=256, micro_batch_size=4 (ensure the batch_size/num_of_gpus/micro_batch_size equals the setting above), it gets ~14it/s too (582 iters for one epoch).

So it seems that we do not need to worry about the deceleration mentioned in [1] (at least in this project).

System info: OS: ubuntu 20.04 CPU: AMD Ryzen 9 7950X GPU: 2x 4090 Driver: 520.61.05 CUDA: 11.8

[1] https://discuss.pytorch.org/t/ddp-training-on-rtx-4090-ada-cu118/168366/8

Apr 04 '23 07:04 QinlongHuang

@QinlongHuang I can confirm NCCL_P2P_DISABLE=1 fixed issues with my 4x 4090 on epyc! Also nvidia driver with cuda 12.1 has not yet fixed this NCCL issue with 4090.

Apr 04 '23 08:04 Qubitium

@QinlongHuang I can confirm NCCL_P2P_DISABLE=1 fixed issues with my 4x 4090 on epyc! Also nvidia driver with cuda 12.1 has not yet fixed this NCCL issue with 4090.

Glad to hear that! According to the NVIDIA forums[1], they just fix the p2p bug with the latest nvidia driver 525.105.17 (530.x driver is not updated now) in application layer, which means the p2p transmission is physically locked.

So you can also just update ur nvidia driver.

[1] https://forums.developer.nvidia.com/t/standard-nvidia-cuda-tests-fail-with-dual-rtx-4090-linux-box/233202/43

Apr 04 '23 10:04 QinlongHuang

had this exact same issue as well on a40 GPUs. none of the above fixes worked for me though.

as I commented in https://github.com/tloen/alpaca-lora/issues/3#issuecomment-1535609125, the solution for me was just to not use torchrun and simply run the python script without it. no idea why it worked, but it seems to use DDP fine that way.

May 05 '23 02:05 jpgard

alpaca-lora alpaca-lora copied to clipboard

Get stucked on multi GPUs training

alpaca-lora
alpaca-lora copied to clipboard