ViBERTgrid-PyTorch icon indicating copy to clipboard operation
ViBERTgrid-PyTorch copied to clipboard

I need help about customize entities of SROIE dataset

Open kerberosargos opened this issue 1 year ago • 6 comments

Hello, firstly thank your for support in advance.

I would like to expand SROIE entities by using my own dataset. is it possible? Example: I would like to change as following array

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

SROIE_CLASS_LIST = ["others", "company", "date", "time", "address", "total", "tax", "sub_total"] etc...

kerberosargos avatar May 30 '24 06:05 kerberosargos

Yes, it is possible. The main modification lies in the number of categories and the corresponding mappings. Change the SROIE_CLASS_LIST, TAG_TO_IDX, and TAG_TO_IDX_BIO in train_SROIE.py and eval_SROIE.py to your custom entity type, then change the num_classes term in the config yaml file. You may also need to modify the postprocessing rules in eval_SROIE.py accordingly.

ZeningLin avatar May 30 '24 07:05 ZeningLin

Thank you very much for your very fast answer. But I did not understand how modify B- or I- tag. Can you modify for me, according to my expand sample

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

TAG_TO_IDX = {
    "O": 0,
    "B-company": 1,
    "B-date": 2,
    "B-address": 3,
    "B-total": 4,
}

TAG_TO_IDX_BIO = {
    "O": 0,
    "B-company": 1,
    "I-company": 2,
    "B-date": 3,
    "I-date": 4,
    "B-address": 5,
    "I-address": 6,
    "B-total": 7,
    "I-total": 8,
}

kerberosargos avatar May 30 '24 07:05 kerberosargos

And one more question.

I have to use entities for training SORIE's entities as following

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
} 

**or just can I use only box and scripts file without entities **

1,83,41,331,41,331,78,83,78,TAN WOON YANN,other
1,109,171,330,171,330,191,109,191,MR D.I.Y. (M) SDN BHD,company
1,122,190,325,190,325,213,122,213,(CO. RFG : 860671-D),other
1,47,208,391,208,391,233,47,233,LOT 1851-A & 1851-B, JALAN KPB 6,,address
1,62,235,381,235,381,254,62,254,KAWASAN PERINDUSTRIAN BALAKONG,,address
1,70,256,384,256,384,275,70,275,43300 SERI KEMBANGAN, SELANGOR,address
1,125,275,318,275,318,297,125,297,(TESCO PUTRA NILAI),other
1,177,295,266,295,266,317,177,317,-INVOICE-,other
1,12,337,402,337,402,362,12,362,KILAT AUTO ECO WASH & SHINE ES1000 1L,other
1,20,360,160,360,160,383,20,383,WA45 /2A - 12,other
1,16,382,156,382,156,402,16,402,9555916500133,other

kerberosargos avatar May 30 '24 07:05 kerberosargos

Thank you very much for your very fast answer. But I did not understand how modify B- or I- tag. Can you modify for me, according to my expand sample

SROIE_CLASS_LIST = ["others", "company", "date", "address", "total"]

TAG_TO_IDX = {
    "O": 0,
    "B-company": 1,
    "B-date": 2,
    "B-address": 3,
    "B-total": 4,
}

TAG_TO_IDX_BIO = {
    "O": 0,
    "B-company": 1,
    "I-company": 2,
    "B-date": 3,
    "I-date": 4,
    "B-address": 5,
    "I-address": 6,
    "B-total": 7,
    "I-total": 8,
}

For example, if your entity types are [others, type1, type2, type3], the corresponding IDX maps should be:

TAG_TO_IDX = {
    "O": 0,    # Remember to keep the background type (others, or O tag) as the first term
    "B-type1": 1,
    "B-type2": 2,
    "B-type3": 3,
}

TAG_TO_IDX_BIO = {
    "O": 0,   # Remember to keep the background type (others, or O tag) as the first term
    "B-type1": 1,
    "I-type1": 2,
    "B-type2": 3,
    "I-type2": 4,
    "B-type3": 5,
    "I-type3": 6,
}

You may also use the following codes to generate the corresponding mappings:

SROIE_CLASS_LIST = ["others", "company", "date", "time", "address", "total", "tax", "sub_total"]

TAG_TO_IDX_ = ["O"]
TAG_TO_IDX_BIO_ = ["O"]
for cls_type in SROIE_CLASS_LIST[1:]:
    TAG_TO_IDX_.append(f"B-{cls_type}")
    TAG_TO_IDX_BIO_.append(f"B-{cls_type}")
    TAG_TO_IDX_BIO_.append(f"I-{cls_type}")

TAG_TO_IDX = {s: i for i, s in enumerate(TAG_TO_IDX_)}
TAG_TO_IDX_BIO = {s: i for i, s in enumerate(TAG_TO_IDX_BIO_)}

ZeningLin avatar May 30 '24 07:05 ZeningLin

And one more question.

I have to use entities for training SORIE's entities as following

{
    "company": "BOOK TA .K (TAMAN DAYA) SDN BHD",
    "date": "25/12/2018",
    "address": "NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.",
    "total": "9.00"
} 

**or just can I use only box and scripts file without entities **

1,83,41,331,41,331,78,83,78,TAN WOON YANN,other
1,109,171,330,171,330,191,109,191,MR D.I.Y. (M) SDN BHD,company
1,122,190,325,190,325,213,122,213,(CO. RFG : 860671-D),other
1,47,208,391,208,391,233,47,233,LOT 1851-A & 1851-B, JALAN KPB 6,,address
1,62,235,381,235,381,254,62,254,KAWASAN PERINDUSTRIAN BALAKONG,,address
1,70,256,384,256,384,275,70,275,43300 SERI KEMBANGAN, SELANGOR,address
1,125,275,318,275,318,297,125,297,(TESCO PUTRA NILAI),other
1,177,295,266,295,266,317,177,317,-INVOICE-,other
1,12,337,402,337,402,362,12,362,KILAT AUTO ECO WASH & SHINE ES1000 1L,other
1,20,360,160,360,160,383,20,383,WA45 /2A - 12,other
1,16,382,156,382,156,402,16,402,9555916500133,other

For the training phase, only the latter one is required. The codes directly parse the annotations and generate the corresponding BIO tags.

ZeningLin avatar May 30 '24 07:05 ZeningLin

I will try. Thank you very much for your support and effort. Have nice days.

kerberosargos avatar May 30 '24 07:05 kerberosargos