CUTIE icon indicating copy to clipboard operation
CUTIE copied to clipboard

Creating input json file for SROIE dataset

Open mohammedayub44 opened this issue 4 years ago • 3 comments

Hi,

I'm trying to create input json files from images to run some tests for SROIE dataset. I have downloaded the revised datasets from their website. From what I gather is they have two text files (.txt format) along with images(.jpg's). Task1train files have the bounding boxes with texts and task2train files have the annotations texts with classes.

Looking at the sample provided here by @4kssoft : https://github.com/4kssoft/CUTIE/blob/master/invoice_data/Faktura1.pdf_0.json

It needs the multiple value_id's to be populated correctly if the annotated text has multiple words.

For example: Image X51008099073.jpg (from SROIE train dataset) -

The bounding box file has :

112,240,365,240,365,278,112,278,PROSPER NIAGA
114,281,574,281,574,315,114,315,COMPANY NO : SA0099552-P
112,319,534,319,534,357,112,357,LOT PT 1138 ,PT 33122,
114,357,518,357,518,395,114,395,BANDAR MAHKOTA CHERAS
115,395,539,395,539,433,115,433,43200 CHERAS, SELANGOR
114,431,325,431,325,469,114,469,SITE : 2365
...

Annotations file has:

{
    "company": "PROSPER NIAGA",
    "date": "26/06/18",
    "address": "LOT PT 1138 ,PT 33122, BANDAR MAHKOTA CHERAS 43200 CHERAS, SELANGOR",
    "total": "100.00"
}

The "address" annotation value matches lines 3, 4 and 5 from bounding box file. The same is true for "total" annotation (not shown here). If you could share some insights on how do I automatically parse and link these into multiple value_id and value_text fields. Not sure If I'm missing something. I see the same cases for many other texts inside other images in this dataset as well.

Thanks !

mohammedayub44 avatar Oct 23 '20 05:10 mohammedayub44

I am also facing same issue

ghost avatar Nov 28 '20 10:11 ghost

We generate the texts and corresponding bounding boxes with Google’s OCR API. Each text and their bounding box is manually la- belled as one of the 9 different classes: ’DontCare’, ’Ven- dorName’, ’VendorTaxID’, ’InvoiceDate’, ’InvoiceNum- ber’, ’ExpenseAmount’, ’BaseAmount’, ’TaxAmount’, and ’TaxRate’

Looks like this is the solution.

monolidth avatar Dec 22 '20 11:12 monolidth

@monolidth which annotation tool you have used.

ghost avatar Dec 22 '20 14:12 ghost