see
see copied to clipboard
Prepared Datasets for download
I am unable to find the prepared datasets on the site that you have linked to. Do I need to prepare the datasets myself? Is it possible to get them in the format suitable for the network?
Do you want to download the SVHN datasets? If so you can find them here. If not, which dataset are you trying to find?
Thank you! This is what I was looking for. One question, though, what is the difference between the "centred" and the "easy" images? Which ones would I use for training the network myself?
One more question: I have a dataset of around 100-500 documents. I wanted to test this network on those documents. How would I go about doing this? Would it be possible for me to get the actual coordinates of the bounding boxes that the network detects?
I am a bit confused by what is mentioned in the dataset part of the README. I have a few questions about it:
For training you will need: the file svhn_char_map.json (you can find it in the folder datasets/svhn) the ground truth files of the dataset you want to use
It is mentioned that you would need the ground truth files of the dataset that you are working with. What are the ground truth files in case of the SVHN dataset? I just see the "train" and "validate" folders which contain the images. Are they considered ground truth?
Add one line to the beginning of each ground truth file: {number of house numbers in image} {max number of chars per house number} (both values need to be separated by a tab character). If you are using the grid dataset it could look like that: 4 4.
In the event that I had to prepare the SVHN dataset on my own, where would I add this? I just see the .png files. I am not really sure about this line. Could you please explain a bit more about this? Is this what the "train.csv", "test.csv" and "valid.csv" files encode?
Apologies for firing so many questions!
Thanks again, Christian!
We created two different datasets:
-
easy
contains all images in a predefined grid (figure 5 left in the paper) -
centered
contains images where the SVHN numbers are placed randomly in the image, but rather near the center (figure 5 right in the paper)
Regarding your dataset: It highly depends on what your documents look like. If they are quite similar to the ones you trained the network with (means mainly same classes and similar appearance (you can not use a network trained on MNIST for SVHN and vice versa)), then it should work and you might be able to use the evaluation script, although you might nee to adjust it. You can extract the predicted bboxes from the network. They can be found in the generated sampling grids.
The ground truth files are the csv
files that contain the path to an image + the labels.
Add one line to the beginning of each ground truth file: {number of house numbers in image} {max number of chars per house number} (both values need to be separated by a tab character). If you are using the grid dataset it could look like that: 4 4.
This is necessary for training under the curriculum. The line you'll need to add is some metadata about the train dataset and tells the program, how the data looks like. It is used to pad data if the curriculum increases the difficulty.
So you will need to add this to each of the csv
files. If you take a look at the png
files in the easy
directory, you can see that there are always 4 house numbers we want to detect and each number has a maximum length of 4 characters, thus you will need to add 4 4
(digits seperated by tabs) as first line to each of the files.
I hope that helps!
Oh, I see! I need to add the metadata to the start of the png
files. What tool did you use to add the metadata? Can it be done using a library like PIL?
Also, do you have any suggestions on how would I modify this network for a document? Let's say that there are 3 text regions in the document. The number of chars in the region might vary. Is there any other solution for this other than to do it anyway?
I would be trying to extend your approach to get better accuracy for text detection and recognition than current OCR solutions.
Oh, I may have stated that wrongly: you do not have to add this to the png
files. You have to add this metadata as the first row of the csv
files (train.csv
, valid.csv
, test.csv
).
I do not think that you have to modify the network very much. You'll only have to set the parameters for the maximum number of text regions (that is the number of steps the localization network runs) and the maximum number of characters per time step. So if you have three text regions at every time, you can set this to three, and if the number of characters per text region varies, you have to find a reasonable maximum (lets say its 23) and add this line: 3 23
(separated by a tab) to the beginning of your train.csv
file. You will also need to pad all your words with the blank label (you can choose which label this should be, but I'd recommend 0
as this is the default)
Hi Christian,
I was able to solve the error. However, I am unable to get NCCL to run properly. It gives me the following error: Exception: NCCL is not enabled. MultiprocessParallelUpdater requires NCCL.
I have installed NCCL and reinstalled chainer and cupy.
I think I'll have to rewrite your code again unless I am able to solve this error. Do you have any suggestions for this?
Yeah you have two options:
- try to figure out why NCCL is not properly detected while installing cupy (you could try to reinstall cupy with
pip install --upgrade --force-reinstall --no-cache-dir -v cupy
and look at the output of the compilation of cupy. Search for all errors regarding NCCL and try to fix them. - Change the code. This should also be easy, but you will loose the ability to train on multiple GPUs in parallel. You will need to exchange
MultiprocessParallelUpdater
byStandardUpdater
(if you are using the scripttrain_svhn.py
this would be in this line, you'll also need to change this line totrain_iterators = chainer.iterators.MultiprocessIterator(gpu_datasets[0], args.batch_size)
)
It should work then ;)
Option 1 worked! I wonder why it did not work when I was manually installing it. However, I am now getting this error. Is there any reason for this? Or is it a problem on my end? Or have I forgotten to mention some parameters?
Traceback (most recent call last):
File "chainer/train_svhn.py", line 146, in
I think Option 1 woked for you because pip
is a very nasty program when it comes to reinstalling stuff. If you already have installed a package and want to reinstall it with just pip install cupy
it won't do anything, even with the flag --upgrade
or --force-reinstall
it won't really do so... that is why I suggested this complex command^^
Regarding your error: Which command line did you use to start the program?
I used the following command:
python3 chainer/train_svhn.py --char-map ./data/svhn_char_map.json -b 32 ./data/curriculum.json logs/ --blank-label 0
curriculum.json
:Path to the csv files
logs/
: The logs directory
oh I see the problem. The code does not work on CPU (unfortunately) and you did not specify which gpu to use (i.e. -g 0
or --gpu 0
). If you do that it should work.
Basically the error tells us that there are not enough devices specified for the MultiProcessIterator
.
Oh! I see. Yes, I rectified that error. I thought that it would take the default GPU if there wasn't anyone specified.
I am getting another error.
/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py:131: UserWarning: optimizer.eps is changed to 1e-08 by MultiprocessParallelUpdater for new batch size. format(optimizer.eps))
Exception in main training loop: Invalid operation is performed in: Concat (Forward)
Expect: in_types.size > 0
Actual: 0 <= 0
Traceback (most recent call last):
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/trainer.py", line 299, in run
update()
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updater.py", line 223, in update
self.update_core()
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 206, in update_core
loss = _calc_loss(self._master, batch)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/training/updaters/multiprocess_parallel_updater.py", line 235, in _calc_loss
return model(*in_arrays)
File "/home/rohit_shinde12194/see/chainer/utils/multi_accuracy_classifier.py", line 44, in call
self.y = self.predictor(*x)
File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 214, in call
return self.recognition_net(images, h)
File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 138, in call
final_lstm_predictions = F.concat(final_lstm_predictions, axis=0)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 90, in concat
y, = Concat(axis).apply(xs)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 230, in apply
self._check_data_type_forward(in_data)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/function_node.py", line 298, in _check_data_type_forward
self.check_type_forward(in_type)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/functions/array/concat.py", line 22, in check_type_forward
type_check.expect(in_types.size() > 0)
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 524, in expect
expr.expect()
File "/home/rohit_shinde12194/.local/lib/python3.5/site-packages/chainer/utils/type_check.py", line 482, in expect
'{0} {1} {2}'.format(left, self.inv, right))
Will finalize trainer extensions and updater before reraising the exception.
Traceback (most recent call last):
File "chainer/train_svhn.py", line 257, in
I think that the forward pass is failing. However, I am not sure what is causing it to fail.
I am working on a couple of methods for end to end text detection and recognition which would work well on documents. I think your STN-OCR and SEE are one of the few papers on the topics.
I have some questions and thoughts on this. Do reply as you see fit!
- I am thinking of taking a generic text detector and attaching its output to another text recognizer and then training them end-to-end. However, there needs to be a differentiable connection between them. Do you have any idea how this might be done?
- Why did you use a ResNet? In the paper, you argue that gradient information needs to be strong in order to train the network well. I wanted to substitute a ResNet for a CRNN and do some experiments as well. It might lead to some different insights.
- I am checking out your code. In the
chainer.models
directory, there are the following files:fsns.py
,fsns_resnet.py
,ic_stn.py
,svhn.py
,text_recognition.py
. If I want to swap out the ResNet for a different recognition net, I would do it intext_recognition.py
, right? - In another issue, you mention that Chainer is much easier to use than Tensorflow. I find Tensorflow code easier to understand given that they now use Estimators. That is a matter of personal taste however. But is it possible to code the same thing in Tensorflow as well? I think other than the Spatial Transformer Grid, you can get almost everything in Tensorflow. Am I missing something?
- And one final question! Thanks for bearing with me till here. Do you have any suggestions for improvement that I could look at? Something that you missed out or did not think that would work? I am working on my Masters Thesis and a part of that involves running a bunch of experiments on ene-to-end text detectors and recognizers. My personal motive is to improve the benchmarks on text documents so I am looking to accomplish that.
There seems to be something wrong with your arrays. The datatype is not correct, you should check that.
Regarding your questions:
- STN-OCR and SEE are not the only methods for end-to-end detection and recognition. In the last two months I also saw other methods. But the unique property of STN-OCR and SEE is that those systems do not need any groundtruth for the localization of text regions
- I also had the same problem and I decided to use a Spatial Transformer Network and crop the localized text regions from the image using the differentiable Image Sampler. So far this is the best way I was able to figure out.
- Well you can substitute the ResNet by a CRNN, but as I said in the paper: I highly recommend using a ResNet like feature extractor because of the strong gradient information that is retained by the ResNet. So if you go for CRNN, use everything that makes a CRNN a CRNN but exchange the convolutional feature extractor by a ResNet like structure and you should be good to go
- That depends on the task you want to achieve. The names of the files are misleading (I apologize for that). Everything that has to do with
SVHN
experiments can be found insvhn.py
(that includes the detection and the recognition net for SVHN experiments), everythin that has to do withFSNS
can be found infsns.py
and so on. - You can code everything in Tensorflow, too. Even Spatialtransformerrid works there. If you want to do that it should be possible.
- The most important thing that needs to be improved is the training of the localization network and also the number of maximum localizations. Right now this is too constrained and difficult to learn. A different approach is necessary. In my point of view, this is one of the most important things.
Thanks for the reply, Christian!
How would I go about checking the arrays? I am a bit stumped here, so if you could give me a starting point, I could start my debugging from there.
this call
File "/home/rohit_shinde12194/see/chainer/models/svhn.py", line 138, in call final_lstm_predictions = F.concat(final_lstm_predictions, axis=0)
is the cause of the problem. So just set a breakpoint in this line and have a look at the list final_lstm_predictions
there should be something wrong with this array.
I did not get time to look at this yesterday, so I'll be looking at it today. What is a good way to understand your code? Where is the very first layer of the network? I am assuming that this line is where the model takes in input data. Am I right? I plan on actually going step by step through the models to understand what is going on underneath.
Yes, you are right. This is the lone where the model takes the input data and does the forward pass through the network.
It might be a good idea to use the python debugger (pdb
or the ipython debugger ipdb
) to step through the code and see what is happening where, you should also have a look at the Chainer documentation and learn how a network is built using this library ;)
I finally got to debugging it. This line fails to go into the loop because num_labels
is 0. When I checked why, I found out that this line gets the number of labels. That command returns a 0. I am not really sure why. Do you have any idea why this happens?
I'd say that something in your train csv is wrong. could you paste the first 3 lines of the label file you are using for training?
I have pasted them below:
4 4 train/0.png 4 7 train/1.png 5 1 train/2.png 3 3 train/3.png 6 8
that does not look right... in the first line you are stating, that you want to localize a maximum of 4 text lines per image (first 4) and then, that each text line has a maximum of 4 characters (second 4). But the next lines only have 2 labels. That is not enough. You should have something like:
4 4
train/0.png 4 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10
train/1.png 5 10 10 10 1 10 10 10 10 10 10 10 10 10 10 10
...
This assumes that 10
is your blank label.
So make sure to pad your data with the blank label.
For instance the line
train/0.png 4 10 10 10 7 10 10 10 10 10 10 10 10 10 10 10
says that you have 2 text lines, the first text line has one character (class 4), the second line has one character (class 7), the next two lines are not there (blank label).
Oh! Now, I get it. I was misunderstanding it all the while. One more question, I am using the train.csv
that was present in the SVHN dataset that you linked to. I thought it would be in the correct format. Wouldn't that train.csv
work right out of the box?
Your code is running when I use the train.csv
file present in the easy
folder. That is the correct format.
One last question, why does your method need to have the number of lines and the number of characters as ground truth? Why can't it infer them? Why can't you let the LSTM run for around 50 timesteps or so? I have noticed that other methods like CTPN or CRNN don't need them.
Nice, that you got it! I'm sorry it is very difficult to explain properly :sweat_smile:...
The train.csv
file in the centered
folder is already in the correct format. The first line of metadata is just missing. This line should be 1 2
because there is always one line of text in the image, and each line of text contains a maximum of 2 characters.
That's fine. I know how difficult these things can be to explain. But do you mind explaining why do you need to have the number of lines and number of characters as ground truth?
Sure,
I need the maximum number of lines and maximum number of characters as groundtruth, because of the following reasons:
- I need to know in advance how many timesteps I have to run my LSTM, because the network has no possibility to know when it should stop, yet
- Knowing this makes it very easy to create batches, as you always now how many items will be in your batch. This also prevents you from memory exhaustion in the middle of the training, as the memory consumption is constant over every training iteration
- It is necessary to provide this information in the
csv
file, because training happens under a curriculum (thejson
file you created describes this curriculum)
Does that make sense to you?
- Yes, I figured that it would be needed for the LSTM. However, other methods for text recognition don't really require the number of timesteps as input (I maybe wrong on this, but I did not notice it). That's why I had this question.
That is my main confusion actually.
You do not need the number if timesteps as input, you could also do everything without that information in the groundtruth file. It is just necessary to put it into the groundtruth file because of the way I implemented Curriculum Learning. This metadata is necessary, in order to successfully train the model under the curriculum regime. Without the curriculum, there would be no need for this data in the groundtruth file.