see
see copied to clipboard
Training SEE on ICDAR Born Digital Dataset
I have a few queries while training SEE on the Born Digital dataset. It is basically flyers and digitally made advertisements.
- How do I verify whether training is going correctly? What method did you use for this?
- How do I ensure that since there are multiple GT in a single image, that the network is correctly associating GTs with what it is detecting?
- In the
logs/
folder, the model is not generated. Do you have any idea why? I have 410 images for training and my batch size is 32. - Related to the previous query, is a size of 410 too small considering the parameters of the network?
@rohit12 the reason your model file not generated is you may have not set --snapshot-interval argument. By default it is set to 20000. so when your script reaches total of 20000 iteration a snapshot will be generated.
Hi,
- You can verify that training is going correctly, by having a look at the images located in the
bboxes
folder in yourlog_dir
. Those images show the predictions of the network on a test image and if that improves over time, it seems to work - The only way to ensure this right now, is to order the GT in a consistent way for each image. Otherwise you will need to find a loss that can work with random alignment. (we always forced the GT to be ordered from left to right and top to bottom)
- @saq1410 you are right, that is the problem here
- I think 410 images is way to small. There are too many parameters that need to be optimized and the task is not easy at all, so it will be more than difficult for the network to learn. Your network should also overfit heavily on this limited dataset. Maybe you can find a way to generate similar looking data...
Hi Christian,
We are facing a few other problems for the Born Digital dataset.
- While creating the the video, we are facing the following error
/src/datasets/BornDigital/logs_new/2018-04-17T02:31:56.662657_training/boxes$ python3 ../../../../../see/utils/create_video.py ./ ./video.mp
4
loading images
sort and cut images
creating temp file
convert -quality 100 @/tmp/tmpp5m2rc0n /tmp/tmp65e4ijjz/1000.mpeg
^BdKilled
Traceback (most recent call last):
File "../../../../../see/utils/create_video.py", line 109, in <module>
make_video(args.image_dir, args.dest_file, batch_size=args.batch_size, start=args.start, end=args.end, pattern=args.pattern)
File "../../../../../see/utils/create_video.py", line 56, in make_video
temp_file = create_video(i, temp_file, video_dir)
File "../../../../../see/utils/create_video.py", line 92, in create_video
subprocess.run(' '.join(process_args), shell=True, check=True)
File "/usr/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command 'convert -quality 100 @/tmp/tmpp5m2rc0n /tmp/tmp65e4ijjz/1000.mpeg' returned non-zero exit status 137
- How to interpret the images stored in the boxes folder in the logs. For the Born Digital dataset, the following are a few of the examples over training.
1.png
10.png
100.png
500.png
1250.png
Are these images the visualization of what region on the input image, the current layer is being focused on with the first being the focus of the output layer?
- Do you have any suggestions for some other ground truth format? We want to look into the google 1000 dataset but converting them to the format that you have used in the code seems to be a bit of time consuming task.
alright:
- I think its not working because the images could be too large (meaning width and/or height), to be fit into a video container. You could set the keyword argument
render_extracted_rois
toFalse
in the part of the code that creates theBBOXPlotter
object (in thetrain_..
file you are using). This will create smaller images. See the next bullet point for an explanation of what I mean with that. - The images have to be interpreted in the following way:
- the top-left image shows the input image with the predicted bboxes on it
- all the other images in the top row show each individual region crop that has been extracted from the original input image, at the location of the predicted bbox (once you set
render_extracted_rois
toFalse
these images will not be rendered anymore. - the bottom row shows the output of visual backprop for this specific image on top.
- You can choose the groundtruth format any way you like! You will just need to create a new dataset object for that and use it instead of the ones I created. In this object you can parse your groundtruth and supply it to the network as a numpy-array.
The images you posted seem to show that your network is hardly learning anything right now. I'd advise you to take a curriculum approach and start with easy samples (sampes with few words) first and the increase difficulty, otherwise it might not converge.