IQA-PyTorch icon indicating copy to clipboard operation
IQA-PyTorch copied to clipboard

About distributed training (CLIP-IQA)

Open lliee1 opened this issue 1 year ago • 2 comments
trafficstars

Hi, I appreciate your repos. I've been using clip-iqa model in your repo for studying purpose. It worked well on single-gpu setting when I follow your simple training scripts.

I want to use distributed settings (single-node, num_gpu=2). I tried simply changing argument num_gpu: 1 -> 2 in yaml file. and I encountered some device incorrect errors at LayerNorm and PromptLearner forward part. errors

What are some solutions for distributed setting I can try?

Additionally, I include the command I used at the bottom. "python pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml"

Much appreciated, lliee

lliee1 avatar Apr 25 '24 05:04 lliee1

Thanks to your feedback. Training with DataParallel can sometimes be annoying. It is better to use DistributedDataParallel as recommended by official document

You may git pull to the latest version and use DistributedDataParallel with the following command

torchrun --nproc_per_node=2 --master_port=4321 pyiqa/train.py -opt options/train/CLIPIQA/train_CLIPIQA_koniq10k.yml --launcher pytorch

In such case, please set num_gpus=1 in the yml config file.

chaofengc avatar Apr 25 '24 13:04 chaofengc

Thank you for your reply!

Since I'm currently working on something, I'll leave a comment if I get any problems.

lliee1 avatar Apr 30 '24 03:04 lliee1