YOLOF icon indicating copy to clipboard operation
YOLOF copied to clipboard

How to modify the identifier of GPU and the number of GPU to train the model?

Open Huzhen757 opened this issue 3 years ago • 19 comments

Hello, I want to use the under the tools folder 'train_net' script to train the yolof-res101-dc5-1x version of the network, but because the first card of my group's server is occupied by others, I want to use other cards to train, I did not find the statement to modify the GPU number in 'setup' script. so I put num_ gpu,num_ machines and machines_ rank parameters are all changed to 1, but they are still trained with GPU: 0. How to solve it?

Thanks !

Huzhen757 avatar May 14 '21 06:05 Huzhen757

You can specify GPU ids with CUDA_VISIBLE_DEVICES. For example CUDA_VISIBLE_DEVICES=4,5,6,7 pods_train --num-gpus 4, it will use the last 4 GPUs for training. You may need to adjust the warmup iterations and warmup factor when you use fewer GPUs for training.

chensnathan avatar May 14 '21 07:05 chensnathan

I added statements:os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1' in the train_net script。When performing train_net script training, Report an error: Default process group is not initialized How to solve it? And the default batch_ size is 4, I use two 3090 and the memory is 24G to train, how to modify the size of the batch size?

Huzhen757 avatar May 14 '21 08:05 Huzhen757

oh,I know that I need to modify the IMS_PER_BATCH and IMS_PER_DEVICE parameter in the config script to change its batch_size. But, for the training of two 3090 graphics cards, I will change WARMUP_FACTOR and WARMUP_ITERS parameters should be ?

Huzhen757 avatar May 14 '21 08:05 Huzhen757

When you use two GPUs, the error Default process group is not initialized should not show up.

For changing the WARMUP_FACTOR and WARMUP_ITERS: WARMUP_ITERS = 1500 * 8 / NUM_GPUS WARMUP_FACTOR = 1. / WARMUP_ITERS

chensnathan avatar May 14 '21 09:05 chensnathan

I have now modified the corresponding parameters in the config script, but run train_ net script still reports an error: Default process group is not initialized

Huzhen757 avatar May 14 '21 10:05 Huzhen757

Traceback (most recent call last): File "train_net.py", line 106, in launch( File "/media/data/huzhen/YOLOF-torch/cvpods/engine/launch.py", line 56, in launch main_func(*args) File "train_net.py", line 96, in main runner.train() File "/media/data/huzhen/YOLOF-torch/cvpods/engine/runner.py", line 270, in train super().train(self.start_iter, self.start_epoch, self.max_iter) File "/media/data/huzhen/YOLOF-torch/cvpods/engine/base_runner.py", line 84, in train self.run_step() File "/media/data/huzhen/YOLOF-torch/cvpods/engine/base_runner.py", line 185, in run_step loss_dict = self.model(data) File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/media/data/huzhen/YOLOF-torch/playground/detection/coco/yolof/yolof_base/yolof.py", line 133, in forward losses = self.losses( File "/media/data/huzhen/YOLOF-torch/playground/detection/coco/yolof/yolof_base/yolof.py", line 211, in losses dist.all_reduce(num_foreground) File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 953, in all_reduce _check_default_pg() File "/home/hz/anaconda3/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 210, in _check_default_pg assert _default_pg is not None,
AssertionError: Default process group is not initialized

Huzhen757 avatar May 14 '21 10:05 Huzhen757

Could you provide more details about your command for training?

chensnathan avatar May 14 '21 15:05 chensnathan

I am using the train_net script under tools folder for training, Some parameters in the config script are adjusted, including IMS_PER_BATCH, IMS_PER_DEVICE, WARMUP_FACTOR and WARMUP_ITERS parameters。And add extra statement in the train_net script : os.environ['CUDA_VISIBLE_DEVICES'] = '0, 1'. And update the path of Dataset in the base_dataset script. Other default parameters and hyper-paramters dont change.

Huzhen757 avatar May 15 '21 02:05 Huzhen757

You need to add --num-gpus to your command when you train with yolof. BTW, we recommend using pods_train as given in README.

chensnathan avatar May 16 '21 08:05 chensnathan

Now there is a new error in the 'dist URL' parameter: cvpods.engine.launch ERROR: Process group URL: tcp://127.0.0.1:50147 RuntimeError: Address already in use

ai...Your code actually is too hard to run。。。。

Huzhen757 avatar May 16 '21 11:05 Huzhen757

Why not just follow the steps in README. It should work well.

chensnathan avatar May 17 '21 08:05 chensnathan

Using the method in REDEME to train, it can only modify the number of GPUs, but it definitely can't update the identifier of GPU to train at all.

Huzhen757 avatar May 17 '21 09:05 Huzhen757

It can.... I give an exmaple above.

You can specify GPU ids with CUDA_VISIBLE_DEVICES. For example CUDA_VISIBLE_DEVICES=4,5,6,7 pods_train --num-gpus 4, it will use the last 4 GPUs for training. You may need to adjust the warmup iterations and warmup factor when you use fewer GPUs for training.

chensnathan avatar May 17 '21 09:05 chensnathan

Ok,I konw. Take 2 GPUs for training , it still report error : assert base_world_size == 8, "IMS_PER_BATCH/DEVICE in config file is used for 8 GPUs" AssertionError: IMS_PER_BATCH/DEVICE in config file is used for 8 GPUs

The number of GPUs required by your code is too large. My team only has 4 GPUs per machine,I don't think I can train.....ai....

Huzhen757 avatar May 18 '21 03:05 Huzhen757

I useing 4 GPUs for training with the way you provided, like this: CUDA_VISIBLE_DEVICES=0,1,2,3 pods_train --num-gpus 4

But it still report a error : RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370156314/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, invalid usage, NCCL version 2.7.8

How could I solve it ? Thanks !

Huzhen757 avatar May 19 '21 11:05 Huzhen757

Many reasons can produce this error. You can refer to this solution and have a try.

chensnathan avatar May 20 '21 00:05 chensnathan

OK,I trying to see if I could work it out. Thanks !

Huzhen757 avatar May 20 '21 05:05 Huzhen757

这个代码太难跑了

xuyuyan123 avatar May 31 '21 07:05 xuyuyan123

这个代码太难跑了

是的,很难跑,他是与基于cvpods库实现的, 需要安装这个库然后编译这个库,然后在源码中还要编译。而且最少要四张卡才能跑,非常吃显卡。。。之前我试了4张2080ti跑,结果还是报错,也就是上面个的error。难定,不想train这个代码了,其实这篇论文的encoder部分倒是可以学习的,其他的地方我懒得花时间了。。还得跑自己的实验,唉。。。

Huzhen757 avatar May 31 '21 07:05 Huzhen757