DeepSpeech
DeepSpeech copied to clipboard
Freeze layers for transfer learning
Currently when doing transfer learning we reinitialize the uppermost layers randomly. Afterwards training is continuing normally. But this has the problem that gradients are also propagated through the lower layers we would like to keep, and changes them too.
These changes allow training the reinitialized uppermost layers only. After this you can start a new training, further optimizing the complete network.
No Taskcluster jobs started for this pull request
The `allowPullRequests` configuration for this repository (in `.taskcluster.yml` on the
default branch) does not allow starting tasks for this pull request.
@DanBmh do you have any indication that this works better than tuning the whole network? In our experiments (see earlier versions of the Common Voice paper and Josh' thesis -- Chapter 8) we found that tuning all layers works quite a bit better than freezing any of them, see e.g. this figure:
@ftyers I'm working on it right now:)
But the approach I'm suggesting is a bit different to yours. It's using both steps.
My transfer-learing workflow would look like this:
- training with frozen layers
- training with all layers (you have to start a new training for this)
Just a side note to the topic: I'm not reinitializing the last layer if possible, because in my experiments with German I got better results than with random initialization of the last layer
@DanBmh So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?
I'd be interested in seeing the results when you have them!
So, you freeze all layers apart from the last one, and then train the last layer. Then when that has trained, you train all the layers?
Exactly
First test was not as good as planned: (using my other pr #3245; es_epochs=7; reduce_lr_plateau_epochs=3; es_min_delta=0.9; no augmentation)
Language | Dataset | Additional Infos | Losses | Training epochs of best model / Total duration |
---|---|---|---|---|
DE | Voxforge | frozen transfer-learning, then training all layers | Test: 37.707958, Validation: 41.832220 | 12+3; Time: 42min |
DE | Voxforge | without frozen transfer-learning | Test: 36.630890, Validation: 41.208125 | 7; Time: 28min |
Not sure why, maybe some training randomness. Because I don't think this should lead to worse results.
I did run another test, this time I tried your approach with dropping and reinitializing the last layer. (As noted above, I normally don't drop the layer when training German, I just train over the English weights)
Language | Dataset | Additional Infos | Losses | Training epochs of best model / Total duration |
---|---|---|---|---|
DE | Voxforge | dropped last layer | Test: 42.516270, Validation: 47.105518 | 8; Time: 28min |
DE | Voxforge | with frozen transfer-learning in two steps | Test: 36.600590, Validation: 40.640134 | 14+8; Time: 42min |
Here you can see an improvement if using the frozen transfer-learning approach. (Note one the dataset: Voxforge has 31h, I'm using about 5h each for dev+test, rest for training. So it's quite small)
So I would say if the network architecture did not change it's faster to train with the English weights (no dropping, no freezing), but if the network had to be changed (different alphabet) it's better to train in two steps with the frozen network.
Of course we would need to do some more test before we can say this for sure.
Not being judgmental but i think we'll at least wait after 1.0 to merge that
@DanBmh -- even though my previous research with deepspeech seems to point to frozen-transfer not working, I still think this feature should be integrated. Your two-step approach makes perfect intuitive sense, and there's work from computer vision and NLP that shows frozen transfer works very well.
So, I think the feature would be useful, but @lissyx is right, this is a post-1.0 feature.
@DanBmh -- I don't see why you need a new flag for load_frozen_graph
. For reference, this is how I implemented transfer-learning + freezing layers before: https://github.com/mozilla/STT/blob/transfer-learning2/DeepSpeech.py#L264-L278
I don't see why you need a new flag for `load_frozen_graph
Problem is that when loading the checkpoint for the second training the frozen layers have no variables for Adam
. They were not loaded/saved in the first training because they were not used. So we have to reinitialize them.
-
Currently I did use this for my German training (0 dropped, 1 frozen). I think this makes sense for all languages with same alphabet like English and and similar pronunciation. This could also be used for transfer-learning of dialects. Instead of random initialized weights you would get somewhat matching weights at the beginning of the training.
-
You're right about this one and I think reinitialization after training is a good idea. I hope I can find some time in the next days to test it.
@DanBmh Please dont let that sink, if you have some time to rebase :)
Not being judgmental but i think we'll at least wait after 1.0 to merge that
We might change our opinion here, right @reuben ?
@JRMeyer what do you think about reinitialization of tensors named "Adam" by default if they are missing? With an additional message for users that they are reinitialized because they are not in the checkpoint.
I can't think of an elegant way to reinitialize the adam-tensors before checkpoint saving at the moment. Because we would need to reinitialize them every time before we save a checkpoint (some might want to load intermediate checkpoints for some reasons).
Not being judgmental but i think we'll at least wait after 1.0 to merge that
We might change our opinion here, right @reuben ?
Yeah, I think it should be fine to land this once the comments here have been addressed. It's also a good opportunity to make the training functionality more fine grained instead of the huge train.py which can do a billion different things depending on the combination of flags that's passed. It's really hard to reason about what a training call is going to do unless you're deeply familiar with the code.
@JRMeyer ping