blocks-examples icon indicating copy to clipboard operation
blocks-examples copied to clipboard

Machine Translation Save and Load model problem

Open tzzcl opened this issue 9 years ago • 10 comments

Hello,

While playing with the machine translation example I came across an unexpected crash. I can't reload the previous train model and see Warning.

When I come across the code in checkpoints.py

In Line 138:

        params_all = self.load_parameters()
        params_this = main_loop.model.get_parameter_dict()

I print the content of params_all and params_this.

I find that params_all contains many '-' but params_this only contains '/',

so it will cause load parameter fails.

so can anyone fix this problems?

Thanks in advance

tzzcl avatar Mar 12 '16 05:03 tzzcl

+1. I simply replace '-' back to '/'. It can work, but the perp seems to be wrong after resuming training. I guess that there may be something wrong with the state loading part.

magic282 avatar Apr 21 '16 14:04 magic282

Sounds like this is due to https://groups.google.com/forum/#!searchin/blocks-users/Serialization/blocks-users/1wy205t-q4M/jk_5g-GqBgAJ

On 21 April 2016 at 10:39, magic282 [email protected] wrote:

+1. I simply replace '-' back to '/'. It can work, but the perp seems to be wrong after resuming training. I guess that there may be something wrong with the state loading part.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks-examples/issues/87#issuecomment-212950161

rizar avatar Apr 27 '16 15:04 rizar

@rizar I tried to replace the checkpoint code with the latest code in saveload.py. It can be loaded, but it seems the state or something is messed.

INFO:blocks.algorithms:Initializing the training algorithm INFO:blocks.algorithms:The training algorithm is initialized TRAINING HAS BEEN RESUMED loaded_from: models/wsj_save_test\wsj_checkpoint.100

wsj_checkpoint 100 perplexity: 8.91602993011 wsj_checkpoint 101 perplexity: 9.54819965363 wsj_checkpoint 102 perplexity: 68.5513381958 wsj_checkpoint 103 perplexity: 17.3546619415 wsj_checkpoint 104 perplexity: 19.3960876465 wsj_checkpoint 105 perplexity: 13.9947977066 wsj_checkpoint 106 perplexity: 10.2129096985 wsj_checkpoint 107 perplexity: 9.65995502472 wsj_checkpoint 108 perplexity: 9.79283046722 wsj_checkpoint 109 perplexity: 13.2218437195 wsj_checkpoint 110 perplexity: 9.98828601837 wsj_checkpoint 111 perplexity: 9.75242424011 wsj_checkpoint 112 perplexity: 9.56409931183 wsj_checkpoint 113 perplexity: 8.89967918396 wsj_checkpoint 114 perplexity: 9.52484989166 wsj_checkpoint 115 perplexity: 8.67778491974 wsj_checkpoint 116 perplexity: 9.47644615173 wsj_checkpoint 117 perplexity: 8.71446323395 wsj_checkpoint 118 perplexity: 9.38176059723 wsj_checkpoint 119 perplexity: 8.67112255096

Before the 100 iteration, the perplexity has already dropped to about 9. And after resuming, the 100 iteration seems to be correct, but the perplexity increased significantly and then dropped back. I think this may indicate that something went wrong. Is there anything I missed? Thanks.

magic282 avatar Apr 28 '16 12:04 magic282

This might be related with step rule accumulators

orhanf avatar Apr 28 '16 14:04 orhanf

@orhanf I guess so. I was using AdaGrad. So will the dump contain the adaptive algorithms' accumulators?

magic282 avatar Apr 29 '16 02:04 magic282

@Thrandis might have an answer to that

orhanf avatar Apr 29 '16 03:04 orhanf

Unfortunately, I have no idea. If the model was dumped with the old serialization, then it is likely that it won't be loaded properly with the new serialization mechanism, so you should try to load it with an older version.

If it still doesn't work with the old version of the code, well, there is a bug in the old serialization code, which means that you won't be able to load your model anymore.

You could still try to load the model, save the parameters and create a new model with those parameters, but it is certainly not ideal.

Thrandis avatar Apr 29 '16 19:04 Thrandis

As far as I remember, the custom serialization in the MT example does not save accumulators. Both new and old serializations in Blocks do save them, but only by pickling with the rest of the main loop.

If your perplexity goes back to normal, then I think everything is fine.

On 29 April 2016 at 15:56, César Laurent [email protected] wrote:

Unfortunately, I have no idea. If the model was dumped with the old serialization, then it is likely that it won't be loaded properly with the new serialization mechanism, so you should try to load it with an older version.

If it still doesn't work with the old version of the code, well, there is a bug in the old serialization code, which means that you won't be able to load your model anymore.

You could still try to load the model, save the parameters and create a new model with those parameters, but it is certainly not ideal.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks-examples/issues/87#issuecomment-215862607

rizar avatar Apr 29 '16 21:04 rizar

@Thrandis I tried retraining with blocks 0.1.1 and 0.2.0 release, and found both of them have this problem. (I didn't load a model saved by other blocks code).

@rizar I just replaced the LoadNMT with Load and Checkpoint. I think the code can reload the parameters correctly with some minor problems (no idea what happened), and it can go back to normal.

So is the Load's problem? I found:

if self.load_iteration_state:
                    main_loop.iteration_state = \
                        loaded_main_loop.iteration_state
@property
    def iteration_state(self):
        """Quick access to the (data stream, epoch iterator) pair."""
        return (self.data_stream, self.epoch_iterator)

So if the new serialization dumps everything, is it correct for me to load them this way? Thank you.

magic282 avatar Apr 30 '16 07:04 magic282

Sorry for the late reply.

It is not a problem of a particular extension. It is not possible to keep accumulators of AdaDelta when you save and load your model, unless you are pickle the whole main loop. This, on the other hand, can be slow and have other undesirable side effects. That's why often people just save the parameters and the log, e.g. the machine translation examples only saves this information. With the accumulators initialized to zero in the beginning of training, it is not surprising that you see some fluctuations of the log-likelihood.

On 30 April 2016 at 03:28, magic282 [email protected] wrote:

@Thrandis https://github.com/Thrandis I tried retraining with blocks 0.1.1 and 0.2.0 release, and found both of them have this problem. (I didn't load a model saved by other blocks code).

@rizar https://github.com/rizar I just replaced the LoadNMT with Load and Checkpoint. I think the code can reload the parameters correctly with some minor problems (no idea what happened), and it can go back to normal.

So is the Load's problem? I found:

if self.load_iteration_state: main_loop.iteration_state =
loaded_main_loop.iteration_state

@property def iteration_state(self): """Quick access to the (data stream, epoch iterator) pair.""" return (self.data_stream, self.epoch_iterator)

So if the new serialization dumps everything, is it correct for me to load them this way? Thank you.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/mila-udem/blocks-examples/issues/87#issuecomment-215944004

rizar avatar May 23 '16 23:05 rizar