stylegan2-pytorch icon indicating copy to clipboard operation
stylegan2-pytorch copied to clipboard

autograd error in g_path_regularize in multi-GPUs training

Open KelestZ opened this issue 4 years ago • 7 comments

Hi, thanks for your efforts ;) I am trying to deploy the code on multi-GPUs on a single machine instead of distributed training to save time.

Then I met an issue in g_path_regularize: RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

I tried several ways but didn't work great. I was wondering

  1. why using distributed training doesn't have this issue
  2. how to fix it? Shall it be transferred to a single tensor as in a WGANGP? Is there any other way? Thanks for your response in advance.

KelestZ avatar May 06 '20 06:05 KelestZ

Did you use nn.DataParallel? g_path_regularize is not compatible with nn.DataParallel. (It is due to multi gpu gather operations in the nn.DataParallel.)

rosinality avatar May 06 '20 08:05 rosinality

Hi @rosinality thanks for the great repository. Would you have any hints on how to make g_path_regularize work with nn.DataParallel ?

beniz avatar May 20 '20 15:05 beniz

@beniz Maybe you can calculate g_path_regularize inside the generator forward, and only return the loss.

rosinality avatar May 20 '20 23:05 rosinality

@beniz Maybe you can calculate g_path_regularize inside the generator forward, and only return the loss.

Thanks a lot for your effort! I actually encountered exactly the same issue when calling g_path_regularizer w/ DataParallel. I would like to know if you have an implementation of the comments above.

lychenyoko avatar Jul 02 '20 16:07 lychenyoko

@lychenyoko I haven't tried. Do you need DataParallel instead of DistributedDataParallel? (Because of windows?)

rosinality avatar Jul 02 '20 23:07 rosinality

@rosinality Never mind. I have managed to do that and thanks for your advice.

lychenyoko avatar Jul 03 '20 00:07 lychenyoko

@rosinality @lychenyoko Could you please let me know how you addressed that issue? I am using DistributedDataParallel right now but got stuch at the step of synchronization (https://github.com/rosinality/stylegan2-pytorch/blob/bef283a1c24087da704d16c30abc8e36e63efa0e/train.py#L439).

ruinianxu avatar Jun 27 '23 15:06 ruinianxu