BLIP
BLIP copied to clipboard
Training with one GPU
Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco
to
python train_retrieval.py
--config ./configs/retrieval_coco.yaml
--output_dir output/retrieval_coco
moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".
Could you give me some hints? Thanks
Have you tried setting nproc_per_node
to 1
?
python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco
Hi, no. I will try this option! If I want to use this option I have to leave the flag --distributed to True and the configuration understands itself what to do. Right? I will let you know. Thanks
Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?
Sorry, I don't know the answers to these questions.
Hi, using your command and leaving --distributed=True the train is started. Thanks. What are the training times for the train_retrieval fine tuning on the coco dataset? My setup uses a GPU TeslaT4 and the training images of the coco dataset are of the order 500k. How long did you take for one epoch with 8 A100s on 500k images? If I have the ability to use 4 GPUs is it enough to set --nproc_per_node = 4?
Finetuning on COCO with 8 A100s takes a few hours.
Hi,
i try the train with 4 GPU TESLA T4 but i am getting these worning and errors
-
Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device. The same for the other four ranks
-
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
I am using torch==1.9.1+cu111 and torchvision==0.10.1+cu111. What version of Torch did you use? Did you also set other option in your machine? For example here https://github.com/PyTorchLightning/pytorch-lightning/issues/4420#issuecomment-919478212 set export NCCL_IB_DISABLE=1 and export NCCL_P2P_DISABLE=1 on A100s
I used pytorch 1.10
Hello, just use torchrun instead: torchrun train_caption.py
for e.g.
Hi, as you explain in the readme, you finetuned the pre-trained checkpoint using 8 A100 GPUs using a distributed package. Unfortunately i have just one GPU, thus i cannot use the distributed package. Thus i change the command from
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco
to
python train_retrieval.py --config ./configs/retrieval_coco.yaml --output_dir output/retrieval_coco
moreover i set the number of workers to 0 in the create_loader function and set the --distributed args in the main function to False, but i received the error "RuntimeError: Default process group has not been initialized, please make sure to call init_process_group".
Could you give me some hints? Thanks
for debug , you need to change 'all_gather ' function, just return input tensor
efault process group has not been initialized, please make sure to call init_process_group".
![]()
If you are not using distributed training, changing these two functions to this format can solve this problem. The specific functions are located in: blip_retrieval.py
Have you tried setting
nproc_per_node
to1
?python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco
Thank you for taking the time to look at my problem! When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got
"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "
Is there any need to set? Thank you very much for your help!
Have you tried setting
nproc_per_node
to1
?python -m torch.distributed.run --nproc_per_node=1 train_retrieval.py \ --config ./configs/retrieval_coco.yaml \ --output_dir output/retrieval_coco
Thank you for taking the time to look at my problem! When I was training train_caption.py on windows, I set nproc_per_node=1 and default=False in main function, but I still got
"RuntimeError: Default process group has not been initialized, please make sure to call init_process_group "
Is there any need to set? Thank you very much for your help!
i have same question,May I ask if you have resolved it?