glow-pytorch icon indicating copy to clipboard operation
glow-pytorch copied to clipboard

Broken paralelization

Open Svito-zar opened this issue 6 years ago • 4 comments

When I try to run the model on several GPUs I am getting a numerical error:

Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.

While running on a single GPU everything works just fine.

That indicated that there is an issue with parallelization

Svito-zar avatar Jun 06 '19 12:06 Svito-zar

Hey Svito-zar i am training on a single gpu but it shows the same warning( Warning: NaN or Inf found in input tensor). Please guide me how should i solve this problem.
Screenshot from 2019-07-19 14-14-50

zain-ul-abedien avatar Jul 19 '19 09:07 zain-ul-abedien

I didn't have your problem and I don't know how to fix it either. Would be interested to know the solution as well.

What I find weird from the Machine Learning perspective is that your batch_size is very small. It causes gradient to vary a lot and that might lead to numerical instabilities. So I would try much larger barch_sizes. At least 20. Better 50.

Svito-zar avatar Jul 22 '19 07:07 Svito-zar

I have the same problem with large batch_size 64, have you guys found the solution? Help, please.

ueoo avatar Sep 03 '19 14:09 ueoo

I found some problems with parallelization too. When I try to run the model on more than one GPU, the process just freeze on the forward stage, namely this line in trainer.py: z, nll, y_logits = self.graph(x=x, y_onehot=y_onehot) The program is still running but I can't see any output after this line. However, One GPU works fine.

pptrick avatar Feb 25 '21 12:02 pptrick