PICK-pytorch icon indicating copy to clipboard operation
PICK-pytorch copied to clipboard

Preparing tsv file for custom dataset

Open prabhakar-sivanesan opened this issue 4 years ago • 3 comments

Hi, firstly thanks for the model it worked perfectly good on the custom dataset. But I have two doubts in preparing the tsv data for training.

  1. When I have 3 words associated to one entity, does all the three words has to seperatly annotated in tsv file or they have to be combined into one ?

Example, this is the data

sample

In shipping address column, Kothuri Sai Kiran is a name. My OCR model gives these 3 words separatly as Kothuri, Sai and Kiran. So while preparing the tsv file, can I annotate it as 3 different row like this,

18,1009,490,1198,490,1198,553,1009,553,Kothuri,name 19,1206,495,1501,495,1501,552,1206,552,Sai,name 20,1619,501,1707,501,1707,560,1619,560,Kiran,name

or all three words has to be combined like this,

18,1009,490,1707,501, 1707,560,1009,553, Kothuri Sai Kiran, name

  1. When you see the Billing address column, I have the same name Kothuri Sai Kiran. Is it possible to tag this name to the same entity "name" ? In a nut shell, Can I have multiple ocr data tagged to one entity for a single image file ?

Looking forward to your response.

prabhakar-sivanesan avatar Dec 28 '20 15:12 prabhakar-sivanesan

@prabhakar-sivanesan : Is it detecting all the entity in your custom dataset? How many data samples did you pass to the model to get the better result?

ninjakx avatar Jan 06 '21 05:01 ninjakx

@ninjakx I was training for only 5 entities and I used about 70 samples with 70/30 split. I was able to get better results for that.

prabhakar-sivanesan avatar Jan 30 '21 04:01 prabhakar-sivanesan

@prabhakar-sivanesan Hi Prabhakar, would you let me know which annotation tool you used for preparing the custom dataset?

Nivedita-mahato2 avatar Jun 10 '21 12:06 Nivedita-mahato2