BLIP icon indicating copy to clipboard operation
BLIP copied to clipboard

Training with one GPU

Open enrico310786 opened this issue 2 years ago • 12 comments

Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

to

python train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints? Thanks

enrico310786 avatar Mar 31 '22 05:03 enrico310786

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

woctezuma avatar Mar 31 '22 08:03 woctezuma

Hi, no. I will try this option! If I want to use this option I have to leave the flag --distributed to True and the configuration understands itself what to do. Right? I will let you know. Thanks

enrico310786 avatar Mar 31 '22 09:03 enrico310786

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

enrico310786 avatar Mar 31 '22 12:03 enrico310786

Sorry, I don't know the answers to these questions.

woctezuma avatar Mar 31 '22 13:03 woctezuma

Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?

Finetuning on COCO with 8 A100s takes a few hours.

LiJunnan1992 avatar Mar 31 '22 23:03 LiJunnan1992

Hi,

i try the train with 4 GPU TESLA T4 but i am getting these worning and errors

  1. Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. The same for the other four ranks

  2. RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8

I am using torch==1.9.1+cu111 and torchvision==0.10.1+cu111. What version of Torch did you use? Did you also set other option in your machine? For example here https://github.com/PyTorchLightning/pytorch-lightning/issues/4420#issuecomment-919478212 set export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1 on A100s

enrico310786 avatar Apr 01 '22 04:04 enrico310786

I used pytorch 1.10

LiJunnan1992 avatar Apr 03 '22 14:04 LiJunnan1992

Hello, just use torchrun instead: torchrun train_caption.py for e.g.

helleuch avatar Apr 22 '22 13:04 helleuch

Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

to

python train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco

moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".

Could you give me some hints? Thanks

for debug , you need to change 'all_gather ' function, just return input tensor

rocklee2022 avatar May 05 '23 03:05 rocklee2022

efault process group has not been initialized, please make sure to call init_process_group". image image

If you are not using distributed training, changing these two functions to this format can solve this problem. The specific functions are located in: blip_retrieval.py

shams2023 avatar Sep 28 '23 06:09 shams2023

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Thank you for taking the time to look at my problem! When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set? Thank you very much for your help!

Y-HuiMing-Y avatar Jan 16 '24 08:01 Y-HuiMing-Y

Have you tried setting nproc_per_node to 1?

python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Thank you for taking the time to look at my problem! When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got

"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "

Is there any need to set? Thank you very much for your help!

i have same question,May I ask if you have resolved it?

Fir1314 avatar Mar 31 '24 13:03 Fir1314