Recipes Implementation of CTC in pure theano with custom gradient

Unfortunately, this comes a bit late as theano has recently merged a PR adding some bindings to warp-ctc (https://github.com/Theano/Theano/pull/5949). But I wanted to finish this anyway, so here it is :-).

This implementation is:

written in pure theano
uses an overriden gradient computation which is more resilient to precision issues
fairly compact (suggestions for improvements and readability are welcome)
works in log space for the most parts to prevent precision issues (so does warp-ctc). Note that I haven't use the rescaling trick though (don't know if warp-ctc uses it)

I think it can still be useful to anyone who wants to modify the original cost function. And it can run without extra dependency on any plateform where theano runs already.

Notes:

I haven't battle tested the code, just run tests so far. It seems to give results very close to warp-ctc as it should (10^-7 difference on the gradients).
The code uses OpFromGraph which is relatively recent in the theano codebase.
I have no demo so far, contributions are welcome for that. I think the test script is a poor replacement for a real demo.

Aug 26 '17 21:08 nlgranger

Sorry for the late reply, and thanks for the heads up on the mailing list. Looks cool at first glance! Not quite sure if this belongs in papers or examples... adding a replication of a result from the original paper would place it in the former category, a toy example would probably place it in the latter. Is there a toy example you could come up with, just to demonstrate how to use it, or would you rather just see it merged the way it is?

Nov 29 '17 17:11 f0k

The model from the paper and the data pre-processing part are not overly complicated at first sight, but the prediction algorithm (prefix search) might require some work. I'll try to look into it this week-end.

Nov 30 '17 10:11 nlgranger

but the prediction algorithm (prefix search) might require some work

What about a toy example that uses a less complex prediction method in the end (e.g., just sampling)?

Dec 01 '17 16:12 f0k

It seems there are some precision issues on real world data (TIMIT speech). I need to investigate that first. When I get it to work reliably I think I will run the model with a simple prediction scheme (greedy) for the demo.

Dec 06 '17 14:12 nlgranger

The latests commit should fix most precision issues, but there is still some divergence at some point. The final output layer values (before softmax) explode at some point in the training. This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine. The loss values of my implementation seem correct, but the gradients are a bit off. I was not able to track down the reason any further.

If anyone is interested in getting CTC in pure Theano, some help would be very welcome ;-)

Jan 12 '18 09:01 nlgranger

This happens before any useful output is obtained, the network just learns to predict the blank class all the time.

This (predicting blanks) seems to be a common effect:

https://github.com/amaas/stanford-ctc/issues/3
https://groups.google.com/forum/#!topic/keras-users/pEKdCYcWLss
https://www.reddit.com/r/MachineLearning/comments/47dilt/having_issues_with_speech_recognition_using_ctc/

So maybe this is a good sign ;)

I have added a Tensorflow implementation for the sake of the comparison and a test notebook to compare the TF implementation of CTC with mine.

And the TF implementation works well with the same dataset?

PS: Looking at your notebook, when you call pickle.dump(), you should pass -1 as the third (protocol) argument. This will result in smaller files and much shorter dumping and loading times.

Jan 23 '18 10:01 f0k

Ok, the damn error is fixed now, both loss and gradient are now in line with tensorflow's implementation.

Thanks for reading through this, I have corrected the pickle line. I will let training run for a long time to see if it goes past predicting blank all the time since this is the expected behaviour.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

Jan 24 '18 11:01 nlgranger

Ok, the damn error is fixed now

Great! Bad luck -- I think Theano would have optimized the log-sum-exp expression by itself, but I'm not sure if optimization breaks depending on keepdims or so.

I'm now waiting for some help from the Theano people because the binary variables I use in some places seem to break the graph optimization when the target device is a GPU.

Any progress on this? Do you need some advice? If you don't need those variables for advanced indexing, you may get away by simply casting them to floatX.

Feb 19 '18 16:02 f0k

I wanted something that would remain robust even with optimizations off, especially because I was debugging it myself ;-) so the handwritten logsumexp is safer when properly implemented.

For the optimization errors, I have opened a discussion on theano-users ML but the activity is a bit low right now. It's actually not too serious because it will only trigger a warning with default .theanorc settings.

I think the CTC part is done, but for the experimental demo the results are not good. There must be an issue with the model or the parameters or the data, some difference between this code and the paper. If somebody familiar with CTC trained model could have a look that would be great. Meanwhile I will give it a try when I have some time.

Feb 19 '18 16:02 nlgranger