PICK-pytorch
PICK-pytorch copied to clipboard
Preparing tsv file for custom dataset
Hi, firstly thanks for the model it worked perfectly good on the custom dataset. But I have two doubts in preparing the tsv data for training.
- When I have 3 words associated to one entity, does all the three words has to seperatly annotated in tsv file or they have to be combined into one ?
Example, this is the data

In shipping address column, Kothuri Sai Kiran is a name. My OCR model gives these 3 words separatly as Kothuri, Sai and Kiran. So while preparing the tsv file, can I annotate it as 3 different row like this,
18,1009,490,1198,490,1198,553,1009,553,Kothuri,name
19,1206,495,1501,495,1501,552,1206,552,Sai,name
20,1619,501,1707,501,1707,560,1619,560,Kiran,name
or all three words has to be combined like this,
18,1009,490,1707,501, 1707,560,1009,553, Kothuri Sai Kiran, name
- When you see the Billing address column, I have the same name Kothuri Sai Kiran. Is it possible to tag this name to the same entity "name" ? In a nut shell, Can I have multiple ocr data tagged to one entity for a single image file ?
Looking forward to your response.
@prabhakar-sivanesan : Is it detecting all the entity in your custom dataset? How many data samples did you pass to the model to get the better result?
@ninjakx I was training for only 5 entities and I used about 70 samples with 70/30 split. I was able to get better results for that.
@prabhakar-sivanesan Hi Prabhakar, would you let me know which annotation tool you used for preparing the custom dataset?