glow-pytorch
glow-pytorch copied to clipboard
Broken paralelization
When I try to run the model on several GPUs I am getting a numerical error:
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
Warning: NaN or Inf found in input tensor.
While running on a single GPU everything works just fine.
That indicated that there is an issue with parallelization
Hey Svito-zar i am training on a single gpu but it shows the same warning( Warning: NaN or Inf found in input tensor). Please guide me how should i solve this problem.

I didn't have your problem and I don't know how to fix it either. Would be interested to know the solution as well.
What I find weird from the Machine Learning perspective is that your batch_size is very small. It causes gradient to vary a lot and that might lead to numerical instabilities. So I would try much larger barch_sizes. At least 20. Better 50.
I have the same problem with large batch_size 64, have you guys found the solution? Help, please.
I found some problems with parallelization too. When I try to run the model on more than one GPU, the process just freeze on the forward stage, namely this line in trainer.py:
z, nll, y_logits = self.graph(x=x, y_onehot=y_onehot)
The program is still running but I can't see any output after this line. However, One GPU works fine.