DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

Freeze layers for transfer learning

Open DanBmh opened this issue 3 years ago • 16 comments

Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.

These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.

DanBmh avatar Aug 13 '20 12:08 DanBmh

No Taskcluster jobs started for this pull request

The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure: Captura de 2020-08-13 15-12-46

ftyers avatar Aug 13 '20 14:08 ftyers

@ftyers I'm working on it right now:)

But the approach I'm suggesting is a bit different to yours. It's using both steps.

My transfer-learing workflow would look like this:

  1. training with frozen layers
  2. training with all layers (you have to start a new training for this)

Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer

DanBmh avatar Aug 13 '20 14:08 DanBmh

@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

I'd be interested in seeing the results when you have them!

ftyers avatar Aug 13 '20 15:08 ftyers

So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

Exactly


First test was not as good as planned: (using my other pr #3245; es_epochs=7; reduce_lr_plateau_epochs=3; es_min_delta=0.9; no augmentation)

Language Dataset Additional Infos Losses Training epochs of best model / Total duration
DE Voxforge frozen transfer-learning, then training all layers Test: 37.707958, Validation: 41.832220 12+3; Time: 42min
DE Voxforge without frozen transfer-learning Test: 36.630890, Validation: 41.208125 7; Time: 28min

Not sure why, maybe some training randomness. Because I don't think this should lead to worse results.

DanBmh avatar Aug 13 '20 15:08 DanBmh

I did run another test, this time I tried your approach with dropping and reinitializing the last layer. (As noted above, I normally don't drop the layer when training German, I just train over the English weights)

Language Dataset Additional Infos Losses Training epochs of best model / Total duration
DE Voxforge dropped last layer Test: 42.516270, Validation: 47.105518 8; Time: 28min
DE Voxforge with frozen transfer-learning in two steps Test: 36.600590, Validation: 40.640134 14+8; Time: 42min

Here you can see an improvement if using the frozen transfer-learning approach. (Note one the dataset: Voxforge has 31h, I'm using about 5h each for dev+test, rest for training. So it's quite small)


So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network.

Of course we would need to do some more test before we can say this for sure.

DanBmh avatar Aug 14 '20 10:08 DanBmh

Not being judgmental but i think we'll at least wait after 1.0 to merge that

lissyx avatar Aug 14 '20 10:08 lissyx

@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well.

So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature.

JRMeyer avatar Aug 14 '20 12:08 JRMeyer

@DanBmh -- I don't see why you need a new flag for load_frozen_graph. For reference, this is how I implemented transfer-learning + freezing layers before: https://github.com/mozilla/STT/blob/transfer-learning2/DeepSpeech.py#L264-L278

JRMeyer avatar Aug 14 '20 12:08 JRMeyer

I don't see why you need a new flag for `load_frozen_graph

Problem is that when loading the checkpoint for the second training the frozen layers have no variables for Adam. They were not loaded/saved in the first training because they were not used. So we have to reinitialize them.

DanBmh avatar Aug 14 '20 12:08 DanBmh

  1. Currently I did use this for my German training (0 dropped, 1 frozen). I think this makes sense for all languages with same alphabet like English and and similar pronunciation. This could also be used for transfer-learning of dialects. Instead of random initialized weights you would get somewhat matching weights at the beginning of the training.

  2. You're right about this one and I think reinitialization after training is a good idea. I hope I can find some time in the next days to test it.

DanBmh avatar Aug 18 '20 18:08 DanBmh

@DanBmh Please dont let that sink, if you have some time to rebase :)

lissyx avatar Aug 28 '20 16:08 lissyx

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

lissyx avatar Aug 28 '20 16:08 lissyx

@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint.

I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons).

DanBmh avatar Aug 29 '20 13:08 DanBmh

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code.

reuben avatar Aug 29 '20 21:08 reuben

@JRMeyer ping

reuben avatar Oct 19 '20 11:10 reuben