see Prepared Datasets for download

I am unable to find the prepared datasets on the site that you have linked to. Do I need to prepare the datasets myself? Is it possible to get them in the format suitable for the network?

Feb 17 '18 19:02 rohit12

Do you want to download the SVHN datasets? If so you can find them here. If not, which dataset are you trying to find?

Feb 18 '18 01:02 Bartzi

Thank you! This is what I was looking for. One question, though, what is the difference between the "centred" and the "easy" images? Which ones would I use for training the network myself?

One more question: I have a dataset of around 100-500 documents. I wanted to test this network on those documents. How would I go about doing this? Would it be possible for me to get the actual coordinates of the bounding boxes that the network detects?

Feb 18 '18 23:02 rohit12

I am a bit confused by what is mentioned in the dataset part of the README. I have a few questions about it:

For training you will need: the file svhn_char_map.json (you can find it in the folder datasets/svhn) the ground truth files of the dataset you want to use

It is mentioned that you would need the ground truth files of the dataset that you are working with. What are the ground truth files in case of the SVHN dataset? I just see the "train" and "validate" folders which contain the images. Are they considered ground truth?

Add one line to the beginning of each ground truth file: {number of house numbers in image} {max number of chars per house number} (both values need to be separated by a tab character). If you are using the grid dataset it could look like that: 4 4.

In the event that I had to prepare the SVHN dataset on my own, where would I add this? I just see the .png files. I am not really sure about this line. Could you please explain a bit more about this? Is this what the "train.csv", "test.csv" and "valid.csv" files encode?

Apologies for firing so many questions!

Thanks again, Christian!

Feb 18 '18 23:02 rohit12

We created two different datasets:

easy contains all images in a predefined grid (figure 5 left in the paper)
centered contains images where the SVHN numbers are placed randomly in the image, but rather near the center (figure 5 right in the paper)

Regarding your dataset: It highly depends on what your documents look like. If they are quite similar to the ones you trained the network with (means mainly same classes and similar appearance (you can not use a network trained on MNIST for SVHN and vice versa)), then it should work and you might be able to use the evaluation script, although you might nee to adjust it. You can extract the predicted bboxes from the network. They can be found in the generated sampling grids.

The ground truth files are the csv files that contain the path to an image + the labels.

Add one line to the beginning of each ground truth file: {number of house numbers in image} {max number of chars per house number} (both values need to be separated by a tab character). If you are using the grid dataset it could look like that: 4 4.

This is necessary for training under the curriculum. The line you'll need to add is some metadata about the train dataset and tells the program, how the data looks like. It is used to pad data if the curriculum increases the difficulty. So you will need to add this to each of the csv files. If you take a look at the png files in the easy directory, you can see that there are always 4 house numbers we want to detect and each number has a maximum length of 4 characters, thus you will need to add 4 4(digits seperated by tabs) as first line to each of the files.

I hope that helps!

Feb 19 '18 11:02 Bartzi

Oh, I see! I need to add the metadata to the start of the png files. What tool did you use to add the metadata? Can it be done using a library like PIL?

Also, do you have any suggestions on how would I modify this network for a document? Let's say that there are 3 text regions in the document. The number of chars in the region might vary. Is there any other solution for this other than to do it anyway?

I would be trying to extend your approach to get better accuracy for text detection and recognition than current OCR solutions.

Feb 19 '18 16:02 rohit12

Oh, I may have stated that wrongly: you do not have to add this to the png files. You have to add this metadata as the first row of the csv files (train.csv, valid.csv, test.csv).

I do not think that you have to modify the network very much. You'll only have to set the parameters for the maximum number of text regions (that is the number of steps the localization network runs) and the maximum number of characters per time step. So if you have three text regions at every time, you can set this to three, and if the number of characters per text region varies, you have to find a reasonable maximum (lets say its 23) and add this line: 3 23 (separated by a tab) to the beginning of your train.csv file. You will also need to pad all your words with the blank label (you can choose which label this should be, but I'd recommend 0 as this is the default)

Feb 19 '18 16:02 Bartzi

Hi Christian,

I was able to solve the error. However, I am unable to get NCCL to run properly. It gives me the following error: Exception: NCCL is not enabled. MultiprocessParallelUpdater requires NCCL. I have installed NCCL and reinstalled chainer and cupy.

I think I'll have to rewrite your code again unless I am able to solve this error. Do you have any suggestions for this?

Feb 20 '18 00:02 rohit12

Yeah you have two options:

try to figure out why NCCL is not properly detected while installing cupy (you could try to reinstall cupy with pip install --upgrade --force-reinstall --no-cache-dir -v cupy and look at the output of the compilation of cupy. Search for all errors regarding NCCL and try to fix them.
Change the code. This should also be easy, but you will loose the ability to train on multiple GPUs in parallel. You will need to exchange MultiprocessParallelUpdater by StandardUpdater (if you are using the script train_svhn.py this would be in this line, you'll also need to change this line to train_iterators = chainer.iterators.MultiprocessIterator(gpu_datasets[0], args.batch_size))

It should work then ;)

Feb 20 '18 09:02 Bartzi

Option 1 worked! I wonder why it did not work when I was manually installing it. However, I am now getting this error. Is there any reason for this? Or is it a problem on my end? Or have I forgotten to mention some parameters?

Traceback (most recent call last): File "chainer/train_svhn.py", line 146, in updater = MultiprocessParallelUpdater(train_iterators, optimizer, devices=args.gpus) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 121, in init assert len(iterators) == len(devices) AssertionError

Feb 21 '18 00:02 rohit12

I think Option 1 woked for you because pip is a very nasty program when it comes to reinstalling stuff. If you already have installed a package and want to reinstall it with just pip install cupy it won't do anything, even with the flag --upgrade or --force-reinstall it won't really do so... that is why I suggested this complex command^^

Regarding your error: Which command line did you use to start the program?

Feb 21 '18 09:02 Bartzi

I used the following command:

python3 chainer/train_svhn.py --char-map ./data/svhn_char_map.json -b 32 ./data/curriculum.json logs/ --blank-label 0

curriculum.json:Path to the csv files logs/: The logs directory

Feb 22 '18 04:02 rohit12

oh I see the problem. The code does not work on CPU (unfortunately) and you did not specify which gpu to use (i.e. -g 0 or --gpu 0). If you do that it should work. Basically the error tells us that there are not enough devices specified for the MultiProcessIterator.

Feb 22 '18 09:02 Bartzi

Oh! I see. Yes, I rectified that error. I thought that it would take the default GPU if there wasn't anyone specified.

I am getting another error.

/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:131: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size. format(optimizer.eps))

Exception in main training loop: Invalid operation is performed in: Concat (Forward)

Expect: in_types.size > 0 Actual: 0 <= 0 Traceback (most recent call last): File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run update() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update self.update_core() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 206, in update_core loss = _calc_loss(self._master, batch) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in _calc_loss return model(*in_arrays) File "/home/rohit_shinde12194/see/chainer/utils/multi_accuracy_classifier.py", line 44, in call self.y = self.predictor(*x) File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 214, in call return self.recognition_net(images, h) File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 138, in call final_lstm_predictions = F.concat(final_lstm_predictions, axis=0) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 90, in concat y, = Concat(axis).apply(xs) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 230, in apply self._check_data_type_forward(in_data) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 298, in _check_data_type_forward self.check_type_forward(in_type) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 22, in check_type_forward type_check.expect(in_types.size() > 0) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 524, in expect expr.expect() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 482, in expect '{0} {1} {2}'.format(left, self.inv, right)) Will finalize trainer extensions and updater before reraising the exception. Traceback (most recent call last): File "chainer/train_svhn.py", line 257, in trainer.run() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/trainer.py", line 313, in run six.reraise(*sys.exc_info()) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/six.py", line 693, in reraise raise value File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run update() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update self.update_core() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 206, in update_core loss = _calc_loss(self._master, batch) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in _calc_loss return model(*in_arrays) File "/home/rohit_shinde12194/see/chainer/utils/multi_accuracy_classifier.py", line 44, in call self.y = self.predictor(*x) File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 214, in call return self.recognition_net(images, h) File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 138, in call final_lstm_predictions = F.concat(final_lstm_predictions, axis=0) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 90, in concat y, = Concat(axis).apply(xs) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 230, in apply self._check_data_type_forward(in_data) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 298, in _check_data_type_forward self.check_type_forward(in_type) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 22, in check_type_forward type_check.expect(in_types.size() > 0) File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 524, in expect expr.expect() File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 482, in expect '{0} {1} {2}'.format(left, self.inv, right)) chainer.utils.type_check.InvalidType: Invalid operation is performed in: Concat (Forward)

I think that the forward pass is failing. However, I am not sure what is causing it to fail.

Feb 22 '18 15:02 rohit12

I am working on a couple of methods for end to end text detection and recognition which would work well on documents. I think your STN-OCR and SEE are one of the few papers on the topics.

I have some questions and thoughts on this. Do reply as you see fit!

I am thinking of taking a generic text detector and attaching its output to another text recognizer and then training them end-to-end. However, there needs to be a differentiable connection between them. Do you have any idea how this might be done?
Why did you use a ResNet? In the paper, you argue that gradient information needs to be strong in order to train the network well. I wanted to substitute a ResNet for a CRNN and do some experiments as well. It might lead to some different insights.
I am checking out your code. In the chainer.models directory, there are the following files: fsns.py, fsns_resnet.py, ic_stn.py, svhn.py, text_recognition.py. If I want to swap out the ResNet for a different recognition net, I would do it in text_recognition.py, right?
In another issue, you mention that Chainer is much easier to use than Tensorflow. I find Tensorflow code easier to understand given that they now use Estimators. That is a matter of personal taste however. But is it possible to code the same thing in Tensorflow as well? I think other than the Spatial Transformer Grid, you can get almost everything in Tensorflow. Am I missing something?
And one final question! Thanks for bearing with me till here. Do you have any suggestions for improvement that I could look at? Something that you missed out or did not think that would work? I am working on my Masters Thesis and a part of that involves running a bunch of experiments on ene-to-end text detectors and recognizers. My personal motive is to improve the benchmarks on text documents so I am looking to accomplish that.

Feb 22 '18 18:02 rohit12

There seems to be something wrong with your arrays. The datatype is not correct, you should check that.

Regarding your questions:

STN-OCR and SEE are not the only methods for end-to-end detection and recognition. In the last two months I also saw other methods. But the unique property of STN-OCR and SEE is that those systems do not need any groundtruth for the localization of text regions

I also had the same problem and I decided to use a Spatial Transformer Network and crop the localized text regions from the image using the differentiable Image Sampler. So far this is the best way I was able to figure out.
Well you can substitute the ResNet by a CRNN, but as I said in the paper: I highly recommend using a ResNet like feature extractor because of the strong gradient information that is retained by the ResNet. So if you go for CRNN, use everything that makes a CRNN a CRNN but exchange the convolutional feature extractor by a ResNet like structure and you should be good to go
That depends on the task you want to achieve. The names of the files are misleading (I apologize for that). Everything that has to do with SVHN experiments can be found in svhn.py (that includes the detection and the recognition net for SVHN experiments), everythin that has to do with FSNS can be found in fsns.py and so on.
You can code everything in Tensorflow, too. Even Spatialtransformerrid works there. If you want to do that it should be possible.
The most important thing that needs to be improved is the training of the localization network and also the number of maximum localizations. Right now this is too constrained and difficult to learn. A different approach is necessary. In my point of view, this is one of the most important things.

Feb 23 '18 09:02 Bartzi

Thanks for the reply, Christian!

How would I go about checking the arrays? I am a bit stumped here, so if you could give me a starting point, I could start my debugging from there.

Feb 23 '18 11:02 rohit12

this call

File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 138, in call final_lstm_predictions = F.concat(final_lstm_predictions, axis=0)

is the cause of the problem. So just set a breakpoint in this line and have a look at the list final_lstm_predictions there should be something wrong with this array.

Feb 23 '18 11:02 Bartzi

I did not get time to look at this yesterday, so I'll be looking at it today. What is a good way to understand your code? Where is the very first layer of the network? I am assuming that this line is where the model takes in input data. Am I right? I plan on actually going step by step through the models to understand what is going on underneath.

Feb 25 '18 01:02 rohit12

Yes, you are right. This is the lone where the model takes the input data and does the forward pass through the network. It might be a good idea to use the python debugger (pdb or the ipython debugger ipdb) to step through the code and see what is happening where, you should also have a look at the Chainer documentation and learn how a network is built using this library ;)

Feb 25 '18 03:02 Bartzi

I finally got to debugging it. This line fails to go into the loop because num_labels is 0. When I checked why, I found out that this line gets the number of labels. That command returns a 0. I am not really sure why. Do you have any idea why this happens?

Feb 27 '18 15:02 rohit12

I'd say that something in your train csv is wrong. could you paste the first 3 lines of the label file you are using for training?

Feb 27 '18 16:02 Bartzi

I have pasted them below:

4 4 train/0.png 4 7 train/1.png 5 1 train/2.png 3 3 train/3.png 6 8

Feb 27 '18 16:02 rohit12

that does not look right... in the first line you are stating, that you want to localize a maximum of 4 text lines per image (first 4) and then, that each text line has a maximum of 4 characters (second 4). But the next lines only have 2 labels. That is not enough. You should have something like:

4 4
train/0.png 4 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10
train/1.png 5 10 10 10 1 10 10 10 10 10 10 10 10 10 10 10
...

This assumes that 10 is your blank label. So make sure to pad your data with the blank label. For instance the line

train/0.png 4 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10

says that you have 2 text lines, the first text line has one character (class 4), the second line has one character (class 7), the next two lines are not there (blank label).

Feb 27 '18 17:02 Bartzi

Oh! Now, I get it. I was misunderstanding it all the while. One more question, I am using the train.csv that was present in the SVHN dataset that you linked to. I thought it would be in the correct format. Wouldn't that train.csv work right out of the box?

Feb 27 '18 17:02 rohit12

Your code is running when I use the train.csv file present in the easy folder. That is the correct format.

One last question, why does your method need to have the number of lines and the number of characters as ground truth? Why can't it infer them? Why can't you let the LSTM run for around 50 timesteps or so? I have noticed that other methods like CTPN or CRNN don't need them.

Feb 27 '18 20:02 rohit12

Nice, that you got it! I'm sorry it is very difficult to explain properly :sweat_smile:...

The train.csv file in the centered folder is already in the correct format. The first line of metadata is just missing. This line should be 1 2 because there is always one line of text in the image, and each line of text contains a maximum of 2 characters.

Feb 28 '18 09:02 Bartzi

That's fine. I know how difficult these things can be to explain. But do you mind explaining why do you need to have the number of lines and number of characters as ground truth?

Feb 28 '18 13:02 rohit12

Sure,

I need the maximum number of lines and maximum number of characters as groundtruth, because of the following reasons:

I need to know in advance how many timesteps I have to run my LSTM, because the network has no possibility to know when it should stop, yet
Knowing this makes it very easy to create batches, as you always now how many items will be in your batch. This also prevents you from memory exhaustion in the middle of the training, as the memory consumption is constant over every training iteration
It is necessary to provide this information in the csv file, because training happens under a curriculum (the json file you created describes this curriculum)

Does that make sense to you?

Feb 28 '18 13:02 Bartzi

Yes, I figured that it would be needed for the LSTM. However, other methods for text recognition don't really require the number of timesteps as input (I maybe wrong on this, but I did not notice it). That's why I had this question.

That is my main confusion actually.

Feb 28 '18 13:02 rohit12

You do not need the number if timesteps as input, you could also do everything without that information in the groundtruth file. It is just necessary to put it into the groundtruth file because of the way I implemented Curriculum Learning. This metadata is necessary, in order to successfully train the model under the curriculum regime. Without the curriculum, there would be no need for this data in the groundtruth file.

Feb 28 '18 14:02 Bartzi

see see copied to clipboard

Prepared Datasets for download

see
see copied to clipboard