pytorch-ts
pytorch-ts copied to clipboard
unable to reproduce results from notebook
I am unable to reproduce results from TimeGrad Notebook. I am getting diverging loss into NaN loss.
predictor = estimator.train(dataset_train, num_workers=8)
99it [00:22, 4.39it/s, avg_epoch_loss=0.945, epoch=0] 99it [00:22, 4.40it/s, avg_epoch_loss=0.495, epoch=1] 99it [00:22, 4.39it/s, avg_epoch_loss=0.466, epoch=2] 99it [00:22, 4.35it/s, avg_epoch_loss=0.795, epoch=3] 99it [00:22, 4.33it/s, avg_epoch_loss=0.852, epoch=4] 99it [00:22, 4.32it/s, avg_epoch_loss=nan, epoch=5]
99it [00:22, 4.33it/s, avg_epoch_loss=nan, epoch=6] 99it [00:22, 4.30it/s, avg_epoch_loss=nan, epoch=7] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=8] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=9] 99it [00:23, 4.29it/s, avg_epoch_loss=nan, epoch=10] 99it [00:23, 4.28it/s, avg_epoch_loss=nan, epoch=11] 99it [00:22, 4.33it/s, avg_epoch_loss=nan, epoch=12] 99it [00:23, 4.21it/s, avg_epoch_loss=nan, epoch=13] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=14] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=15] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=16] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=17] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=18] 99it [00:23, 4.20it/s, avg_epoch_loss=nan, epoch=19]
thanks for letting me know.. perhaps i screwed something up while refactoring, I'll check and get back to you
@turmeric-blend which parameters did you give the model? I remember that when I set the beta_end to be high I got nans...
my beta_end=0.07, I ran everything the same as the notebook example.
thanks! I just re-ran it again on my machine and it all worked out... very strange... 🤔
I am running from the downloaded zip folder (didn't install via pip install pytorchts). Not sure if this affects anything.
hmm not sure... perhaps in the downloaded zip folder do: pip install . and then try?
I feel like its related to a random seed since we are using different machines....
also which version of pytorch to you use? I am using pytorch 1.7.1 here
pytorch 1.7.1+cu110
I tried with pip install ., nan still occurs, maybe you could update the results with fixed seed for pytorch,mxnet,numpy,random ... etc, and I will see if I can reproduce it?
I can try sure.. I just re-ran it again with the parameters from the paper and checked in the notebook, I also fixed the "cuda" device name...
my cuda settings is like:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
but this is just because I don't have a GPU on 1.
ok so setting a seed e.g. via:
np.random.seed(123456)
torch.manual_seed(123456)
also works for me and I get no nan in training... can you try to perhaps train with less num_workers e.g. 4 or 2 or even 0?
nothing works :/ I reinstalled everything in a clean env and still doesn't work.
I installed gluonts via pip install git+https://github.com/awslabs/gluon-ts.git@master#egg=gluonts.
If I try to install PyTorchTS via pip install pytorchts, I get this error:
ERROR: Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI. pytorchts depends on gluonts@ git+https://github.com/awslabs/gluon-ts.git@master#egg=gluonts
sorry to hear that... i will try to reproduce on a clean env as well!
I have the same results NaN, and I found that the outputs are NaN. It is very strange.
can you try to use the 0.7.0 branch?