Arrebol comments

Results 4 comments of


                                            Arrebol

Multi-gpu on a single node

And now it change to: > RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled cuda error, NCCL version 2.7.8 Setting OMP and MKL num threads to 1.

Multi-gpu on a single node

> @Arrebol2020 Hello，have you solved it? I didn't sovle it, so I try to implement DDP by myself, it seems to work.

配置文件中的BASE_LR和论文里的initial learning rate是什么关系

> 按照代码里的配置完成了训练，发现过程中学习率最高只有BASE_LR参数设定的0.01，对比论文4.1节中提及的initial learning rate为0.05，请问是否应将BASE_LR修改为0.05？论文所说的initial learning rate是指warm up前第一个epoch的学习率，还是warm up后第6个epoch的最大学习率？你好，这个仓库的配置文件中的学习策略在哪里使用了，我看了下代码并没有找到学习策略使用的相关方法，只找到了相关定义

配置文件中的BASE_LR和论文里的initial learning rate是什么关系

> > > 按照代码里的配置完成了训练，发现过程中学习率最高只有BASE_LR参数设定的0.01，对比论文4.1节中提及的initial learning rate为0.05，请问是否应将BASE_LR修改为0.05？论文所说的initial learning rate是指warm up前第一个epoch的学习率，还是warm up后第6个epoch的最大学习率？ > > > > > > 你好，这个仓库的配置文件中的学习策略在哪里使用了，我看了下代码并没有找到学习策略使用的相关方法，只找到了相关定义 > > 你指的是LR_POLICY=cos这个参数吗，这应该是在[DOLG/core/optimizer.py]里用到了吧，每个epoch中通过train_epoch()函数`lr = optim.get_epoch_lr(cur_epoch)`进行设置嗯嗯对，看到了，十分感谢