Fast-Weight-Memory-public icon indicating copy to clipboard operation
Fast-Weight-Memory-public copied to clipboard

FWM - WT2 reproducibility

Open DavidHerel opened this issue 2 years ago • 8 comments

Hi there,

I am currently trying to reproduce your results on WT2 dataset with FWM. I am using exact same script with same PyTorch version see attachment file. However I am not able to obtain your results. After 12000 epochs I got 71 valid ppl instead of 66.

How can I reproduce your results please? I have tried all 3 seeds (1881, 2, 3), however they also do not achieve your reported results.

Thanks run_fwm_wt2_1882.out.log

DavidHerel avatar May 10 '23 09:05 DavidHerel

Hi David, it's been a while since I did this work but I should be able to help you. First, did you follow the instructions in the FWM-README.md? There you find the exact commands for ptb, the ones for WT2 are almost the same. You can also check the exact hyperparameters in the log file as I print the args at the beginning of training.

You can find the logs of the WT2 run here: https://github.com/ischlag/Fast-Weight-Memory-public/blob/main/language-modelling/fwm/logs_wt2/run_fwm_wt2_1881.out As you can see, we train for 1600 epochs and switch optimiser after 1372 epochs.

Also, have you seen Appendix H? There we go into more details. We also tuned/trained the softmax temperature (a scalar that multiplies the logits) on the validation set which improved the results by an additional ppl point. The logs for that can be found here: https://github.com/ischlag/Fast-Weight-Memory-public/blob/main/language-modelling/fwm/logs_wt2/eval_fwm_wt2_1881.out

A the bottom of the eval_*.out files you will find the ppl that we mention in the results table of the paper.

(Edit: In fact, ppl of ~71 on validation after 1,200 steps is pretty much what we have as well)

I'd also recommend to try to reproduce the AWD LSTM and AWD TXL results using our scripts. We were not able to reproduce the AWD-TXL result ourselves which is why we include our own results in the paper and repo.

By the way, despite what the awkward choices for seeds suggest, these are not cherry-picked. I did all of that manually back then.

Lastly, I would recommend to NOT using this code base. If I remember correctly I reused the txl or awdlstm codebase which was already reused several times. There was also an issue with weight dropping when trying to make it work with a newer pytorch version.

In case you are interested, because ptb, wt2, but also wikitext103 are problematic/deprecated, I am currently working on a new benchmark that supports language modelling research at different compute scales with clean code and pretrained models. However, that will still take > 1 month before it goes public.

ischlag avatar May 11 '23 08:05 ischlag

Thank you for the reply!

I have read and followed instructions in the FWM-README.md and I am using exactly same commands. I have also attached my log file, which is identical to yours in the terms of exact hyper parameters (checked through diff-checker), but it is not behaving the same - ppl is much higher for all of three seeds. I have also same PyTorch version.

For the seed 1882, which is behaving the best in your results, I am able to obtain 71.34 valid ppl in epoch 1200, whereas in your log with the same hyper-parameters and same PyTorch version, valid ppl is 66.63.

Is there anything else I could try to get these results?

I have not yet tried to play with temperature as for now I try to train model first. I agree with you that it is very hard to reproduce AWD TXL and AWD LSTM scripts as it is deprecated and I have made a lot of bug fixes.

EDIT: For the PTB I am able to get reported results, but WT2 is problematic

DavidHerel avatar May 11 '23 19:05 DavidHerel

Looks like in your run you do not switch to ASGD. If you check the end of my log on line 5643 after epoch 1372 we switch to ASGD (https://github.com/ischlag/Fast-Weight-Memory-public/blob/64c077f02ec320ec535cb66db3600453e1ef445f/language-modelling/fwm/logs_wt2/run_fwm_wt2_1881.out#LL5643C1-L5643C18). Then we train until epoch 1600.

But in your log you seem to stop after epoch 1200 without the switch to ASGD. That is probably the difference.

Now, I just noticed that valid ppl doesn't change anymore after switching to ASGD. That is probably a bug. It maybe doesn't use the right set of weights for validation after switching to ASGD. But training ppl keeps reducing.

With all that, please check if you have the same issue with PTB.

ischlag avatar May 12 '23 08:05 ischlag

Thank you for the tips. However, even when I try to increase the number of epochs to 1600, I can still not get below 4.26 loss on valid. I am using precisely the same configuration. In the log, I have provided it is cut only to 1200 epochs because training it further did not result in any change. Switching to ASGD after epoch 1372 also did not result in any improvement.

With PTB I am able to reach reported perplexity for a training script. However, changing a temperature with eval.py does not result in any perplexity gain. (see eval_fwm_ptb_141.out.txt)

DavidHerel avatar May 18 '23 20:05 DavidHerel

I currently don't have time to run the evaluation myself. I don't remember how effective the temperature tuning was on ptb but we applied all tricks to both datasets and there was about a 1 ppl difference if I remember correctly. In the log you can see that the optimal on the validation set is not the one with temperature 1.0. This indicates that it helps and validation ppl is indeed lower (although not much). Unfortunately, it doesn't have a test eval without the tuning in that log.

Regarding training, please provide the entire log including the switching and softmax tuning using the shell command to reproduce the results. That should work.

What eval_fwm_ptb_141.out.txt are you linking anyway? Is that from a previous commit? The file you linked is not in log folder of the final commit https://github.com/ischlag/Fast-Weight-Memory-public/tree/main/language-modelling/fwm/logs_ptb

ischlag avatar May 19 '23 09:05 ischlag

I'm totally perplexed about the log you posted because it seems to come the repository but it is not (?!):

This is the training of seed 141: https://github.com/ischlag/Fast-Weight-Memory-public/blob/main/language-modelling/fwm/logs_ptb/run_fwm_ptb_141.out

which has final test ppl of 55.13

here is the softmax tuning: https://github.com/ischlag/Fast-Weight-Memory-public/blob/main/language-modelling/fwm/logs_ptb/eval_fwm_ptb_141.out

which has the best test ppl of 54.48 by scaling the softmax by 1.08

The log you link doesn't even test the softmax scale of 1.08 ...

ischlag avatar May 19 '23 09:05 ischlag

I have posted my log of eval.py, which I have run on trained PTB model. This PTB model achieves reported test perplexity after training however as I wrote eval.py aka temperature tuning does not yield similar improvements as in your log files.

Regarding the training on WT2 I will run it again and provide exact script and log file to help reproduce my issue.

DavidHerel avatar May 19 '23 10:05 DavidHerel

Ah. So that is your log that you uploaded here in this issue and didnt actually link a file of mine. Sorry for the confusion.

You didn't provide the training log for the ptb run. My logs indicate that it helps. So I suppose that your ptb run is also different.

It could be that the issue is somehow deeper. Can you share with me what your motivation is? These datasets are very small for todays standards.

ischlag avatar May 19 '23 10:05 ischlag