ViBERTgrid-PyTorch Configure sroie_data_preprocessing.py for expand CLASS

Hello again. I need a help about expand CLASS_LIST. Firstly thank you for your support in advance

I have configured SROIE_CLASS_LIST = ["others", "company", "address", "document_number", "date_time", "total", "tax"]

Sample box file is as following

182,70,435,70,435,110,182,110,BURGER KING,company
97,112,512,112,512,155,97,155,EKUR İNŞAAT SANAYİ VE TİCARET A.Ş.,other
42,152,570,152,570,200,42,200,MEVLANA MH.Ç.MEHMET CD. NO:33/A MARMARAPARK,address
70,194,544,194,544,242,70,242,AVM. 3F02 ESENLER/İST. TİC. SİC. NO:300241,address
95,238,522,242,522,291,94,287,BOĞAZİÇİ KURUMLAR V.D.330 005 3911,other
44,312,177,312,177,360,44,360,13/05/2024,date_time
390,315,570,315,570,362,390,362,FİŞ NO: 000132,document_number
47,360,192,360,192,407,47,407,SAAT: 21:17,date_time
60,435,265,435,265,482,60,482,1 2TB+K.IC+K.PAT,other
307,440,350,440,350,482,307,482,%10,other
447,437,542,437,542,482,447,482,*119,99,other
87,482,242,482,242,527,87,527,2 TAVUKBRGER,other
472,485,540,485,540,527,472,527,*0,00,other
307,487,350,487,350,527,307,527,%10,other
115,530,347,530,347,575,115,575,1 + Peynir Ekle %10,other
460,530,537,530,537,572,460,572,*10,00,other
470,575,535,575,535,617,470,617,*0,00,other
142,577,347,577,347,620,142,620,+ DomatesEkle %10,other
120,618,278,623,277,667,118,662,1 + TursuEkle,other
305,620,347,620,347,662,305,662,%10,other
467,620,532,620,532,660,467,660,*0,00,other
467,662,530,662,530,702,467,702,*0,00,other
120,665,347,665,347,707,120,707,1 + Sogan Ekle %10,other
97,705,245,705,245,747,97,747,1 KUCUKAYRAN,other
465,705,530,705,530,745,465,745,*0,00,other
305,707,347,707,347,745,305,745,%10,other
465,745,527,745,527,787,465,787,*9,00,other
97,747,232,747,232,790,97,790,1 O.PATATES,other
305,750,345,750,345,787,305,787,%10,other
462,787,527,787,527,827,462,827,*0,00,other
100,790,200,790,200,832,100,832,1 KETCAP,other
305,790,345,790,345,827,305,827,%10,other
462,827,525,827,525,867,462,867,*0,00,other
100,832,210,832,210,872,100,872,1 MAYONEZ,other
305,832,345,832,345,870,305,870,%10,other
102,870,245,870,245,912,102,912,1 ISTENMIYOR,other
460,870,522,870,522,910,460,910,*0,00,other
305,872,345,872,345,910,305,910,%10,other
460,910,522,910,522,947,460,947,*0,00,other
102,912,245,912,245,952,102,952,1 ISTENMIYOR,other
302,912,345,912,345,950,302,950,%10,other
447,987,520,987,520,1027,447,1027,*12,64,tax
72,990,210,990,210,1030,72,1030,TOPKDV,other
435,1025,517,1025,517,1065,435,1065,*138,99,total
75,1027,209,1027,209,1070,75,1070,TOPLAM,other
432,1102,517,1102,517,1142,432,1142,*138,99,other
75,1107,137,1107,137,1145,75,1145,NAKİT,other
75,1145,290,1145,290,1182,75,1182,POS:3 RefNum:30122,other
119,1204,487,1200,487,1242,120,1246,Sipariş Numarası:,other
250,1242,342,1242,342,1282,250,1282,3122,other
74,1307,234,1304,235,1346,75,1349,Kasiyer: 25620,other
74,1352,537,1347,537,1379,75,1385,*********************************************************************************************************************************,other
77,1387,504,1385,505,1427,77,1430,Asagidaki web sitesinde anket doldurun.,other
75,1427,375,1427,375,1467,75,1467,King boy secim bedava alin.,other
77,1470,367,1470,367,1507,77,1507,www.burgerkingdeneyimi.com,other
77,1508,334,1505,335,1547,77,1550,Sifre: 2182851100391240,other
77,1550,245,1550,245,1590,77,1590,Doğrulama Kodu:,other
77,1589,477,1589,477,1629,77,1629,Sifre ve dogrulama kodu alindigindan,other
79,1631,504,1626,505,1668,80,1673,itibaren 15 gun icinde kullanilmalidir,other
77,1672,477,1668,477,1709,77,1712,Sartlar ve icerik web sayfasindadir.,other
79,1717,544,1709,545,1739,80,1747,*********************************************************************************************************************************,other
80,1770,272,1770,272,1807,80,1807,KASİYER:KASİYER 2,other
417,1820,542,1820,542,1852,417,1852,EKÜ NO:0003,other
84,1826,209,1823,210,1857,85,1860,Z NO:001532,other
209,1879,389,1879,389,1904,209,1904,NF 3E 20040058,other

Sample key file is as following

{
    "company": "BURGER KING",
    "address": "MEVLANA MH.Ç.MEHMET CD. NO:33/A MARMARAPARK AVM. 3F02 ESENLER/İST. TİC. SİC. NO:300241",
    "document_number": "FİŞ NO: 000132",
    "date_time": "13/05/2024 SAAT: 21:17",
    "total": "*138,99",
    "tax": "*12,64"
}

According to above data how can I modify is following code.

And I do not want to use regex for fix data pattern. I would like to modify like just raw text

 total_float = re.search(r"([-+]?[0-9]*\.?[0-9]+)", key_info["total"])
    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date' in gt_dataframe
        tab_date = re.findall(
            date_regex,
            row["text"],
        )
        for date in tab_date:
            if date[0] == key_info["date"]:
                gt_dataframe.loc[index, "data_class"] = 2
                gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        tab_floats = re.findall(r"([-+]?[0-9]*\.?[0-9]+)", row["text"])
        if total_float:
            for float_ in tab_floats:
                if float(total_float.group(0)) == float(float_):
                    gt_dataframe.loc[index, "data_class"] = 4
                    gt_dataframe.loc[index, "pos_neg"] = 1

 return gt_dataframe, image_shape

Jun 03 '24 10:06 kerberosargos

Hello again, my changed params according to expanded SROIE_CLASS_LIST is as following

TAG_TO_IDX: {'O': 0, 'B-company': 1, 'B-address': 2, 'B-document_number': 3, 'B-date_time': 4, 'B-total': 5, 'B-tax': 6}

TAG_TO_IDX_BIO: {'O': 0, 'B-company': 1, 'I-company': 2, 'B-address': 3, 'I-address': 4, 'B-document_number': 5, 'I-document_number': 6, 'B-date_time': 7, 'I-date_time': 8, 'B-total': 9, 'I-total': 10, 'B-tax': 11, 'I-tax': 12}

after that I have changed process code as following

# total_float = re.search(r"([-+]?[0-9]*\.?[0-9]+)", key_info["total"])
    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[1].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 2
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'document_number' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date_time' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[3].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 4
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[4].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 5
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'tax' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[5].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 6
            gt_dataframe.loc[index, "pos_neg"] = 1



        # # retrieve 'date' in gt_dataframe
        # tab_date = re.findall(
        #     date_regex,
        #     row["text"],
        # )
        # for date in tab_date:
        #     if date[0] == key_info["date"]:
        #         gt_dataframe.loc[index, "data_class"] = 2
        #         gt_dataframe.loc[index, "pos_neg"] = 1

        # # retrieve 'total' in gt_dataframe
        # tab_floats = re.findall(r"([-+]?[0-9]*\.?[0-9]+)", row["text"])
        # if total_float:
        #     for float_ in tab_floats:
        #         if float(total_float.group(0)) == float(float_):
        #             gt_dataframe.loc[index, "data_class"] = 4
        #             gt_dataframe.loc[index, "pos_neg"] = 1

    return gt_dataframe, image_shape

my result has created as following which is in train_processed\ocr_result dir

,left,top,right,bot,text,data_class,pos_neg
0,182,70,261,110,BURGER,2,1
1,275,70,434,110,"KING,company",2,1
2,97,112,138,155,EKUR,0,2
3,148,112,210,155,İNŞAAT,0,2
4,220,112,282,155,SANAYİ,0,2
5,292,112,312,155,VE,0,2
6,323,112,395,155,TİCARET,0,2
7,406,112,509,155,"A.Ş.,other",0,2
8,42,152,114,200,MEVLANA,0,2
9,124,152,237,200,MH.Ç.MEHMET,0,2
10,248,152,279,200,CD.,0,2
11,289,152,361,200,NO:33/A,4,1
12,371,152,567,200,"MARMARAPARK,address",0,2
13,70,194,107,242,AVM.,0,2
14,117,194,154,242,3F02,0,2
15,164,194,277,242,ESENLER/İST.,0,2
16,287,194,324,242,TİC.,0,2
17,334,194,371,242,SİC.,0,2
18,381,194,542,242,"NO:300241,address",3,1
19,95,238,180,291,BOĞAZİÇİ,0,2
20,191,238,276,291,KURUMLAR,0,2
21,287,238,361,291,V.D.330,0,2
22,372,238,404,291,005,0,2
23,414,238,520,291,"3911,other",0,2
24,44,312,177,360,"13/05/2024,date_time",5,1
25,390,315,408,362,FİŞ,4,1
26,414,315,432,362,NO:,4,1
27,438,315,570,362,"000132,document_number",4,1
28,47,360,81,407,SAAT:,5,1
29,88,360,191,407,"21:17,date_time",5,1
30,60,435,69,482,1,0,2
31,78,435,264,482,"2TB+K.IC+K.PAT,other",0,2
32,307,440,350,482,"%10,other",0,2
33,447,437,542,482,"*119,99,other",6,1
34,87,482,95,527,2,0,2
35,104,482,241,527,"TAVUKBRGER,other",0,2
36,472,485,540,527,"*0,00,other",0,2
37,307,487,350,527,"%10,other",0,2
38,115,530,124,575,1,0,2
39,133,530,142,575,+,0,2
40,151,530,206,575,Peynir,0,2
41,215,530,252,575,Ekle,0,2
42,261,530,344,575,"%10,other",0,2
43,460,530,537,572,"*10,00,other",0,2
44,470,575,535,617,"*0,00,other",0,2
45,142,577,150,620,+,0,2
46,159,577,257,620,DomatesEkle,0,2
47,265,577,345,620,"%10,other",0,2
48,120,618,128,667,1,0,2
49,136,618,144,667,+,0,2
50,152,618,275,667,"TursuEkle,other",0,2
51,305,620,347,662,"%10,other",0,2
52,467,620,532,660,"*0,00,other",0,2
53,467,662,530,702,"*0,00,other",0,2
54,120,665,129,707,1,0,2
55,138,665,147,707,+,0,2
56,156,665,203,707,Sogan,0,2
57,212,665,249,707,Ekle,0,2
58,259,665,344,707,"%10,other",0,2
59,97,705,105,747,1,0,2
60,113,705,244,747,"KUCUKAYRAN,other",0,2
61,465,705,530,745,"*0,00,other",0,2
62,305,707,347,745,"%10,other",0,2
63,465,745,527,787,"*9,00,other",0,2
64,97,747,104,790,1,0,2
65,112,747,231,790,"O.PATATES,other",0,2
66,305,750,345,787,"%10,other",0,2
67,462,787,527,827,"*0,00,other",0,2
68,100,790,107,832,1,0,2
69,114,790,199,832,"KETCAP,other",0,2
70,305,790,345,827,"%10,other",0,2
71,462,827,525,867,"*0,00,other",0,2
72,100,832,107,872,1,0,2
73,114,832,209,872,"MAYONEZ,other",0,2
74,305,832,345,870,"%10,other",0,2
75,102,870,109,912,1,0,2
76,117,870,244,912,"ISTENMIYOR,other",0,2
77,460,870,522,910,"*0,00,other",0,2
78,305,872,345,910,"%10,other",0,2
79,460,910,522,947,"*0,00,other",0,2
80,102,912,109,952,1,0,2
81,117,912,244,952,"ISTENMIYOR,other",0,2
82,302,912,345,950,"%10,other",0,2
83,447,987,520,1027,"*12,64,tax",0,2
84,72,990,210,1030,"TOPKDV,other",0,2
85,435,1025,517,1065,"*138,99,total",6,1
86,75,1027,209,1070,"TOPLAM,other",0,2
87,432,1102,517,1142,"*138,99,other",6,1
88,75,1107,137,1145,"NAKİT,other",0,2
89,75,1145,119,1182,POS:3,0,2
90,128,1145,289,1182,"RefNum:30122,other",0,2
91,119,1204,231,1242,Sipariş,0,2
92,247,1204,487,1242,"Numarası:,other",0,2
93,250,1242,342,1282,"3122,other",0,2
94,74,1307,138,1346,Kasiyer:,0,2
95,146,1307,234,1346,"25620,other",0,2
96,74,1352,537,1379,"*********************************************************************************************************************************,other",0,2
97,77,1387,162,1427,Asagidaki,0,2
98,172,1387,200,1427,web,0,2
99,210,1387,295,1427,sitesinde,0,2
100,305,1387,352,1427,anket,0,2
101,362,1387,504,1427,"doldurun.,other",0,2
102,75,1427,111,1467,King,2,1
103,120,1427,147,1467,boy,0,2
104,156,1427,201,1467,secim,0,2
105,210,1427,264,1467,bedava,0,2
106,273,1427,373,1467,"alin.,other",0,2
107,77,1470,367,1507,"www.burgerkingdeneyimi.com,other",0,2
108,77,1508,130,1547,Sifre:,0,2
109,139,1508,334,1547,"2182851100391240,other",0,2
110,77,1550,149,1590,Doğrulama,0,2
111,157,1550,245,1590,"Kodu:,other",0,2
112,77,1589,124,1629,Sifre,0,2
113,134,1589,153,1629,ve,0,2
114,162,1589,247,1629,dogrulama,0,2
115,257,1589,295,1629,kodu,0,2
116,304,1589,475,1629,"alindigindan,other",0,2
117,79,1631,156,1668,itibaren,0,2
118,166,1631,185,1668,15,0,2
119,195,1631,224,1668,gun,0,2
120,233,1631,291,1668,icinde,0,2
121,300,1631,503,1668,"kullanilmalidir,other",0,2
122,77,1672,143,1709,Sartlar,0,2
123,153,1672,172,1709,ve,0,2
124,181,1672,238,1709,icerik,0,2
125,247,1672,275,1709,web,0,2
126,285,1672,475,1709,"sayfasindadir.,other",0,2
127,79,1717,545,1739,"*********************************************************************************************************************************,other",0,2
128,80,1770,205,1807,KASİYER:KASİYER,0,2
129,213,1770,271,1807,"2,other",0,2
130,417,1820,439,1852,EKÜ,0,2
131,446,1820,541,1852,"NO:0003,other",0,2
132,84,1826,91,1857,Z,0,2
133,98,1826,209,1857,"NO:001532,other",0,2
134,209,1879,227,1904,NF,0,2
135,236,1879,254,1904,3E,0,2
136,263,1879,389,1904,"20040058,other",0,2

I think everything is not okay? am I wrong?

Thank you in advance

Jun 03 '24 12:06 kerberosargos

Sorry for my delayed response. Are you currently working on a custom dataset or simply expanding the category types of SROIE?

Jun 04 '24 02:06 ZeningLin

If you are making modifications to the SROIE dataset, one approach could be to retrieve the OCR content of the key fields by utilizing string similarity. By doing so, you may obtain multiple results. To determine the desired result, you can rely on the coordinates. For instance, fields such as "tax" might have a string adjacent to it that contains the keyword "TAX".

It is worth noting that the accuracy of the matched labels can significantly impact the final performance. If it is feasible, I highly recommend considering manual labeling of the OCR results for better performance.

Jun 04 '24 02:06 ZeningLin

Sorry for my delayed response. Are you currently working on a custom dataset or simply expanding the category types of SROIE?

Thank you for your interest. I am working my own dataset not original Sroie dataset

Jun 04 '24 05:06 kerberosargos

Acctualy I have build my own dataset on SROIE's dataset stucture.

I mean I have a image, box txt file and json key txt. Everthing okay on my side. Bbox coordinates and ocr result text data are correct.

Now I am trying to convert my custom sroie stucture dataset to your model by using sroie_data_preprocess.py file.

According to this intorduce

How to modify def ground_truth_extraction( for expanded my SRORIE_CLASS_LIST in pipeline/sroie_data_preprocessing.py
And must I use for same function's split_word param's value as TRUE for better resut?

Jun 04 '24 05:06 kerberosargos

Acctualy I have build my own dataset on SROIE's dataset stucture.

I mean I have a image, box txt file and json key txt. Everthing okay on my side. Bbox coordinates and ocr result text data are correct.

Now I am trying to convert my custom sroie stucture dataset to your model by using sroie_data_preprocess.py file.

According to this intorduce

How to modify def ground_truth_extraction( for expanded my SRORIE_CLASS_LIST in pipeline/sroie_data_preprocessing.py

And must I use for same function's split_word param's value as TRUE for better resut?

In def ground_truth_extraction, rules for matching the key fields in SROIE (company, date, address, total) are provided. For your custom dataset, you may directly employ the similarity matching strategy in my code to retrieve company, address, document_number, and date_time (your modified codes for these categories are correct). For fields like total and tax, the optimal solution is to employ the regex expression.
Based on my experience, a larger granularity (line-level or paragraph-level) may lead to better results, but it varies across datasets. You may try both the line-level and the word-level annotations to find the optimal one.

Jun 04 '24 07:06 ZeningLin

Thank you again. I do not want to use regex for matching. it could be any problem for best result? And as following code modification is correct?

SROIE_CLASS_LIST = ["others", "company", "address", "document_number", "date_time", "total", "tax"]

TAG_TO_IDX: {'O': 0, 'B-company': 1, 'B-address': 2, 'B-document_number': 3, 'B-date_time': 4, 'B-total': 5, 'B-tax': 6}

TAG_TO_IDX_BIO: {'O': 0, 'B-company': 1, 'I-company': 2, 'B-address': 3, 'I-address': 4, 'B-document_number': 5, 'I-document_number': 6, 'B-date_time': 7, 'I-date_time': 8, 'B-total': 9, 'I-total': 10, 'B-tax': 11, 'I-tax': 12}

In def ground_truth_extraction,

    for index, row in gt_dataframe.iterrows():
        # default value
        gt_dataframe.loc[index, "pos_neg"] = 2

        # retrieve 'company' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[0].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 1
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'address' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[1].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 2
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'document_number' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[2].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 3
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'date_time' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[3].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 4
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'total' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[4].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 5
            gt_dataframe.loc[index, "pos_neg"] = 1

        # retrieve 'tax' in gt_dataframe
        if (
            cosine_simularity(
                count_vectorizer[5].reshape(1, -1),
                count_vectorizer[index + len(data_classes)].reshape(1, -1),
            )
            > cosine_sim_treshold
        ):
            gt_dataframe.loc[index, "data_class"] = 6
            gt_dataframe.loc[index, "pos_neg"] = 1

    return gt_dataframe, image_shape

Jun 04 '24 07:06 kerberosargos

I think your code can handle the case well. You may set different cosine_sim_treshold for each category to obtain the optimal result.

Jun 04 '24 08:06 ZeningLin

Thank you so much again, for your great project and support. I will try.

Jun 04 '24 09:06 kerberosargos

Configure sroie_data_preprocessing.py for expand CLASS_LIST