deep-text-recognition-benchmark icon indicating copy to clipboard operation
deep-text-recognition-benchmark copied to clipboard

How did you create lmdb of MjSynth?

Open SaeedArisha opened this issue 4 years ago • 7 comments

Hi, I want to know how you guys converted the MjSynth dataset to lmdb using their provided annotations?

SaeedArisha avatar Dec 17 '20 12:12 SaeedArisha

Managed it myself, if anyone needs help let me know

SaeedArisha avatar Dec 18 '20 03:12 SaeedArisha

Managed it myself, if anyone needs help let me know

Managed it myself, if anyone needs help let me know

Hi, I would like to know how you convert the MJSynth dataset to lmdb format. I am not sure how to do it as the tar.gz file is too big to be explored and there are also over 4000 folders having sub folders of image. Not particularly sure on the path issues. Kindly provide step-by-step solutions here with code so that others could benefit, thanks a lot in advance 👍

ysee007 avatar Dec 22 '20 16:12 ysee007

Managed it myself, if anyone needs help let me know

Managed it myself, if anyone needs help let me know

Hi, I would like to know how you convert the MJSynth dataset to lmdb format. I am not sure how to do it as the tar.gz file is too big to be explored and there are also over 4000 folders having sub folders of image. Not particularly sure on the path issues. Kindly provide step-by-step solutions here with code so that others could benefit, thanks a lot in advance

In order to convert to lmdb you will have to unzip the file and inside you will find 4 text files that are the annotation files, now since I was using the createlmdb.py file that the authors of this repository provided so I had to convert the original annotation files into a specific format i.e. imagepath\tlabel\n. It's very basic fp = open("annotation.txt") lines = fp.readlines() with open('gt.txt', 'w') as f: for i in lines: a = i.split('/') b = a[3] c = b.split('_') d = i.split(' ') f.writelines(d[0]+ '\t'+ c[1] +'\n')

so what I do is read annotations from the annotation.txt file and store them in gt.txt. Next, all you have to do is follow the method the author of this repo has provided and convert the dataset to lmdb. Store the new ground-truth file inside the 90kDICT32px folder where all the 4000 folders reside and the image would be reachable.

Let me know if you have further questions, so I can close this.

SaeedArisha avatar Dec 28 '20 13:12 SaeedArisha

In order to convert to lmdb you will have to unzip the file and inside you will find 4 text files that are the annotation files, now since I was using the createlmdb.py file that the authors of this repository provided so I had to convert the original annotation files into a specific format i.e. imagepath\tlabel\n. It's very basic fp = open("annotation.txt") lines = fp.readlines() with open('gt.txt', 'w') as f: for i in lines: a = i.split('/') b = a[3] c = b.split('_') d = i.split(' ') f.writelines(d[0]+ '\t'+ c[1] +'\n')

so what I do is read annotations from the annotation.txt file and store them in gt.txt. Next, all you have to do is follow the method the author of this repo has provided and convert the dataset to lmdb. Store the new ground-truth file inside the 90kDICT32px folder where all the 4000 folders reside and the image would be reachable.

Let me know if you have further questions, so I can close this.

You can proceed to close this, thank you for the help!

ysee007 avatar Dec 28 '20 15:12 ysee007

If i'm using my own dataset to train the model do I still need the MJSynth Dataset? if not what do I need to use for --select_data?

rouarouatbi avatar Aug 28 '21 14:08 rouarouatbi

If I'm using my own dataset to train the model do I still need the MJSynth Dataset? if not what do I need to use for --select_data?

No, you don't have to use MJSynth Dataset then. When using your own dataset, for --select_data flag, you will pass '/', which indicates that you'll be using all the data inside the train and test folders you have provided.

SaeedArisha avatar Aug 29 '21 09:08 SaeedArisha

If i'm using my own dataset to train the model do I still need the MJSynth Dataset? if not what do I need to use for --select_data?

Hello, I'm also using my own dataset,but I am still facing issues, I successfully created the datasets in the training and validation folders, using create_lmdb_dataset.py which gave two files : data.mdb and lock.mdb. However when running: python3 train.py --train_data data_lmdb_release/training --valid_data data_lmdb_release/validation --batch_ratio 1 --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn

I get this error: File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py", line 103, in init "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0 I have seen all the previous issues concerning this error, but non seem to work. If anyone could help I would be very grateful. Thanks in advance.

abo-jaafar avatar Oct 05 '21 13:10 abo-jaafar