pytorch-ts icon indicating copy to clipboard operation
pytorch-ts copied to clipboard

unable to reproduce results from notebook

Open turmeric-blend opened this issue 4 years ago • 17 comments

I am unable to reproduce results from TimeGrad Notebook. I am getting diverging loss into NaN loss.

predictor = estimator.train(dataset_train, num_workers=8)

99it [00:22, 4.39it/s, avg_epoch_loss=0.945, epoch=0] 99it [00:22, 4.40it/s, avg_epoch_loss=0.495, epoch=1] 99it [00:22, 4.39it/s, avg_epoch_loss=0.466, epoch=2] 99it [00:22, 4.35it/s, avg_epoch_loss=0.795, epoch=3] 99it [00:22, 4.33it/s, avg_epoch_loss=0.852, epoch=4] 99it [00:22, 4.32it/s, avg_epoch_loss=nan, epoch=5]
99it [00:22, 4.33it/s, avg_epoch_loss=nan, epoch=6] 99it [00:22, 4.30it/s, avg_epoch_loss=nan, epoch=7] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=8] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=9] 99it [00:23, 4.29it/s, avg_epoch_loss=nan, epoch=10] 99it [00:23, 4.28it/s, avg_epoch_loss=nan, epoch=11] 99it [00:22, 4.33it/s, avg_epoch_loss=nan, epoch=12] 99it [00:23, 4.21it/s, avg_epoch_loss=nan, epoch=13] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=14] 99it [00:23, 4.30it/s, avg_epoch_loss=nan, epoch=15] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=16] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=17] 99it [00:22, 4.34it/s, avg_epoch_loss=nan, epoch=18] 99it [00:23, 4.20it/s, avg_epoch_loss=nan, epoch=19]

turmeric-blend avatar Feb 17 '21 08:02 turmeric-blend

thanks for letting me know.. perhaps i screwed something up while refactoring, I'll check and get back to you

kashif avatar Feb 17 '21 09:02 kashif

@turmeric-blend which parameters did you give the model? I remember that when I set the beta_end to be high I got nans...

kashif avatar Feb 17 '21 09:02 kashif

my beta_end=0.07, I ran everything the same as the notebook example.

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

thanks! I just re-ran it again on my machine and it all worked out... very strange... 🤔

kashif avatar Feb 17 '21 09:02 kashif

I am running from the downloaded zip folder (didn't install via pip install pytorchts). Not sure if this affects anything.

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

hmm not sure... perhaps in the downloaded zip folder do: pip install . and then try?

kashif avatar Feb 17 '21 09:02 kashif

I feel like its related to a random seed since we are using different machines....

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

also which version of pytorch to you use? I am using pytorch 1.7.1 here

kashif avatar Feb 17 '21 09:02 kashif

pytorch 1.7.1+cu110

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

I tried with pip install ., nan still occurs, maybe you could update the results with fixed seed for pytorch,mxnet,numpy,random ... etc, and I will see if I can reproduce it?

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

I can try sure.. I just re-ran it again with the parameters from the paper and checked in the notebook, I also fixed the "cuda" device name...

kashif avatar Feb 17 '21 09:02 kashif

my cuda settings is like:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

but this is just because I don't have a GPU on 1.

turmeric-blend avatar Feb 17 '21 09:02 turmeric-blend

ok so setting a seed e.g. via:

np.random.seed(123456)
torch.manual_seed(123456)

also works for me and I get no nan in training... can you try to perhaps train with less num_workers e.g. 4 or 2 or even 0?

kashif avatar Feb 17 '21 10:02 kashif

nothing works :/ I reinstalled everything in a clean env and still doesn't work.

I installed gluonts via pip install git+https://github.com/awslabs/gluon-ts.git@master#egg=gluonts.

If I try to install PyTorchTS via pip install pytorchts, I get this error:

ERROR: Packages installed from PyPI cannot depend on packages which are not also hosted on PyPI. pytorchts depends on gluonts@ git+https://github.com/awslabs/gluon-ts.git@master#egg=gluonts

turmeric-blend avatar Feb 18 '21 00:02 turmeric-blend

sorry to hear that... i will try to reproduce on a clean env as well!

kashif avatar Feb 18 '21 15:02 kashif

I have the same results NaN, and I found that the outputs are NaN. It is very strange.

wsdqy1234 avatar Jun 19 '23 08:06 wsdqy1234

can you try to use the 0.7.0 branch?

kashif avatar Jun 19 '23 08:06 kashif