DeepSpeech Freeze layers for transfer learning

Freeze layers for transfer learning

Open DanBmh opened this issue 3 years ago • 16 comments

Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.

These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.

Aug 13 '20 12:08 DanBmh

No Taskcluster jobs started for this pull request


The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.

Aug 13 '20 12:08 community-tc-integration[bot]

@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure: Captura de 2020-08-13 15-12-46

Aug 13 '20 14:08 ftyers

@ftyers I'm working on it right now:)

But the approach I'm suggesting is a bit different to yours. It's using both steps.

My transfer-learing workflow would look like this:

training with frozen layers
training with all layers (you have to start a new training for this)

Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer

Aug 13 '20 14:08 DanBmh

@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

I'd be interested in seeing the results when you have them!

Aug 13 '20 15:08 ftyers

So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?

Exactly

First test was not as good as planned: (using my other pr #3245; es_epochs=7; reduce_lr_plateau_epochs=3; es_min_delta=0.9; no augmentation)

Language	Dataset	Additional Infos	Losses	Training epochs of best model / Total duration
DE	Voxforge	frozen transfer-learning, then training all layers	Test: 37.707958, Validation: 41.832220	12+3; Time: 42min
DE	Voxforge	without frozen transfer-learning	Test: 36.630890, Validation: 41.208125	7; Time: 28min

Not sure why, maybe some training randomness. Because I don't think this should lead to worse results.

Aug 13 '20 15:08 DanBmh

I did run another test, this time I tried your approach with dropping and reinitializing the last layer. (As noted above, I normally don't drop the layer when training German, I just train over the English weights)

Language	Dataset	Additional Infos	Losses	Training epochs of best model / Total duration
DE	Voxforge	dropped last layer	Test: 42.516270, Validation: 47.105518	8; Time: 28min
DE	Voxforge	with frozen transfer-learning in two steps	Test: 36.600590, Validation: 40.640134	14+8; Time: 42min

Here you can see an improvement if using the frozen transfer-learning approach. (Note one the dataset: Voxforge has 31h, I'm using about 5h each for dev+test, rest for training. So it's quite small)

So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network.

Of course we would need to do some more test before we can say this for sure.

Aug 14 '20 10:08 DanBmh

Not being judgmental but i think we'll at least wait after 1.0 to merge that

Aug 14 '20 10:08 lissyx

@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well.

So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature.

Aug 14 '20 12:08 JRMeyer

@DanBmh -- I don't see why you need a new flag for load_frozen_graph. For reference, this is how I implemented transfer-learning + freezing layers before: https://github.com/mozilla/STT/blob/transfer-learning2/DeepSpeech.py#L264-L278

Aug 14 '20 12:08 JRMeyer

I don't see why you need a new flag for `load_frozen_graph

Problem is that when loading the checkpoint for the second training the frozen layers have no variables for Adam. They were not loaded/saved in the first training because they were not used. So we have to reinitialize them.

Aug 14 '20 12:08 DanBmh

Currently I did use this for my German training (0 dropped, 1 frozen). I think this makes sense for all languages with same alphabet like English and and similar pronunciation. This could also be used for transfer-learning of dialects. Instead of random initialized weights you would get somewhat matching weights at the beginning of the training.
You're right about this one and I think reinitialization after training is a good idea. I hope I can find some time in the next days to test it.

Aug 18 '20 18:08 DanBmh

@DanBmh Please dont let that sink, if you have some time to rebase :)

Aug 28 '20 16:08 lissyx

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

Aug 28 '20 16:08 lissyx

@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint.

I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons).

Aug 29 '20 13:08 DanBmh

Not being judgmental but i think we'll at least wait after 1.0 to merge that

We might change our opinion here, right @reuben ?

Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code.

Aug 29 '20 21:08 reuben

@JRMeyer ping

Oct 19 '20 11:10 reuben

DeepSpeech DeepSpeech copied to clipboard

Freeze layers for transfer learning

DeepSpeech
DeepSpeech copied to clipboard