ctc_tensorflow_example
ctc_tensorflow_example copied to clipboard
OCR: clarification about input and output
I'm trying to solve OCR tasks based on this code.
So what shape input to LSTM should have, suppose we have images [batch_size, height, width, channels]
how should they be reshaped to be used as input? Like [batch_size, width, height*channels]
, so width
is like time dimension
?
What if I want to have variable width? As I understand size of sequences in batch should be the same (common trick just to use padding by zeros at the end of sequence?) or batch_size
should be 1)
What if I want to have variable width and height? As I understand I need to use convolutional + global average pooling / spartial pyramid pooling layers before input to LSTM, so output blob will be [batch_size, feature_map_height, feature_map_width, feature_map_channels]
, how should blob be reshaped to be used as input to LSTM? Like [batch_size, feature_map_width, feature_map_height*feature_map_channels]
? Can we reshape it just to single row like [batch_size, feature_map_width*feature_map_height*feature_map_channels]
it will be like sequence of pixels and we loose some spartial information, will it work?
Here is definition of input, but I'm not sure what it's mean in your case [batch_size, max_stepsize, num_features]
:
https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L90
And how output of LSTM depends on input size and max sequence length? https://github.com/igormq/ctc_tensorflow_example/blob/master/ctc_tensorflow_example.py#L110
BTW: Here is some examples using 'standard' approaches in Keras+Tensorflow which I want to complement with RNN examples. https://github.com/mrgloom/Char-sequence-recognition
Seems related: https://gist.github.com/igormq/eff5b2196a52e89c61ea52515ed87c47
Some info described here, but it still not very clear for me:
https://stackoverflow.com/questions/38059247/using-tensorflows-connectionist-temporal-classification-ctc-implementation
So, we have at input of RNN something like [num_batch, max_time_step, num_features]. We use the dynamic_rnn to perform the recurrent calculations given the input, outputting a tensor of shape [num_batch, max_time_step, num_hidden]. After that, we need to do an affine projection in each tilmestep with weight sharing, so we've to reshape to [num_batch*max_time_step, num_hidden], multiply by a weight matrix of shape [num_hidden, num_classes], sum a bias undo the reshape, transpose (so we will have [max_time_steps, num_batch, num_classes] for ctc loss input), and this result will be the input of ctc_loss function.
Hi @mrgloom , you can use either width or height as your ''time dimension''. Using the width you will perform a row-wise scan, otherwise you will perform a column-wise scan. Also, you can apply conv layers before the LSTM network followed by a Global Average Pooling, returning a tensor with shape [batch_size, feature_map_height, feature_map_width]