tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

text2image created boxes get error - Bad box coordinates in boxfile string!

Open Shreeshrii opened this issue 8 years ago • 13 comments

Page 2
Error in pixaGetCount: pixa not defined
Error in pixaGetCount: pixa not defined
Loaded 41/41 pages (0-41) of document /tmp/tmp.tY7p2Ue5TC/san/san.Baloo.exp0.lstmf
Bad box coordinates in boxfile string! विताः । नानाशस्त्रप्रहरणाः सर्वे 1576 3968 2121 4022 0
Bad box coordinates in boxfile string! रिदेवना ॥ २-२८॥ आश्चर्य 1526 2958 1995 3016 1
Bad box coordinates in boxfile string! ति ॥ २-६४॥ प्रसादे सर्व 1341 4637 1759 4693 2
Bad box coordinates in boxfile string! ति पूरुषः ॥ ३-१९॥ कर्म 1063 2386 1484 2451 2
Bad box coordinates in boxfile string! विभागयोः । गुणा गुणेषु वर्त 420 1710 909 1776 2
Bad box coordinates in boxfile string! न्थिनौ ॥ ३-३४॥ श्रेयान्स्वधर्मो 1447 1278 1982 1335 2
Bad box coordinates in boxfile string! विनाशाय च दुष्कृताम् । धर्म 1364 4402 1863 4475 3
Bad box coordinates in boxfile string! द्धिमान्मनुष्येषु स युक्तः कृत्स्नकर्म 1206 3622 1812 3694 3

Shreeshrii avatar Dec 09 '16 16:12 Shreeshrii

This was on box files created by text2image.

Shreeshrii avatar Jan 13 '17 02:01 Shreeshrii

https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00#error-messages-from-training

Try running text2inage with this option: --output_word_boxes

amitdo avatar Jan 13 '17 10:01 amitdo

Try running text2inage with this option: --output_word_boxes

That creates a unicharset with words

and I get additional errors about Utf8 buffer too big,

Shreeshrii avatar Jan 16 '17 14:01 Shreeshrii

Detected 15 diacritics
Loaded 2742/2742 pages (1-2742) of document /tmp/tmp.aA4DsVmpNZ/hin/hin.CDAC-GISTSurekh.exp0.lstmf
Page 77
Page 97
Bad box coordinates in boxfile string! दि ['ए\\^', 25 सर्व 778 1653 1230 1732 92
Page 82
Loaded 3159/3159 pages (1-3159) of document /tmp/tmp.aA4DsVmpNZ/hin/hin.FreeSans.exp0.lstmf
Page 80

Still getting the errors. The box file is generated by text2image.

Shreeshrii avatar Mar 25 '17 14:03 Shreeshrii

See https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

Bad box coordinates in boxfile string! 

The LSTM trainer only needs bounding box information for a complete textline,
instead of at a character level, but if you put spaces in the box string, like this:

<text for line including spaces> <left> <bottom> <right> <top> <page>
the parser will be confused and give you the error message.

text2image program may need to be fixed if too many errors of this kind are reported.

Shreeshrii avatar Mar 26 '17 10:03 Shreeshrii

Still getting large number of these errors. The box file was created by text2image.


Loaded 17903/17903 pages (1-17903) of document /tmp/tmp.A6VtVWpCVT/san/san.Sanskrit_Text.exp0.lstmf
Loaded 14102/14102 pages (1-14102) of document /tmp/tmp.A6VtVWpCVT/san/san.Santipur_OT_Medium.exp0.lstmf
Bad box coordinates in boxfile string! ति झ. पाठः॥ 67-5 क्षयं 478 567 1002 625 83
Bad box coordinates in boxfile string! विचाराचार अध्याय श्रवस्यः योऽयं 117 3530 800 3593 259
Bad box coordinates in boxfile string! न्धिमहोदयस्य च जनवरी-मार्च 426 1260 1016 1332 345
Bad box coordinates in boxfile string! ण्टिर्गौ न्प्रा स्फो  त्फ द्भ्या अस्योद्यानं 190 2392 908 2450 434
Bad box coordinates in boxfile string! ष्पित॥ यस्य १२ मया। अर्क 418 456 969 511 463
No block overlapping textline: ओदनभोजिकाभिः स्कौनगरिकस्य गुरुपत्नीं षट्तन्त्रीसारे।

Shreeshrii avatar Oct 22 '17 12:10 Shreeshrii

@theraysmith

I looked at the over 2000 textlines which are getting this error for Devanagari training that I am trying to do right now.

The common thread in all of these is that these textlines begin with words beginning with i matraa, which is the only combining mark in Devanagari which is rendered before the consonant it applies to.

Same was the case in the first post here as well as in https://github.com/tesseract-ocr/tesseract/issues/555

Shreeshrii avatar Oct 22 '17 12:10 Shreeshrii

Hi. Rather late but just thought I would add possible cause. I had error Bad box coordinates in boxfile string! ä╗?─┐√£°.3 I fixed it by converting EOL from CR LF to LF in the training_text file.

sharkbait-au avatar Feb 17 '18 11:02 sharkbait-au

Was it solved with best/fast?

amitdo avatar Mar 04 '18 10:03 amitdo

Problem seems to be fixed with newer code.

Using the lines in error from the first post and same font 'Baloo', I only get two error lines.

Bad box coordinates in boxfile string! न्धिमहोदयस्य च जनवरी-मार्च 110 4533 656 4589 0 Bad box coordinates in boxfile string! ष्पित॥ यस्य १२ मया। अर्क 109 4316 584 4365 0

Other fonts do not give the error.

Shreeshrii avatar Mar 04 '18 11:03 Shreeshrii

Still getting the error, same training text , but gives error only with certain fonts.

Bad box coordinates in boxfile string! पि यत् न एनम् यत् फलं 147 3936 604 4008 38
Extracting unicharset from plain text file /tmp/tmp.afQhzt87iE/san/san.GIST-DVOTKishor.exp0.box
Bad box coordinates in boxfile string! पि यत् न एनम् यत् फलं 128 3091 590 3156 36
Extracting unicharset from plain text file /tmp/tmp.afQhzt87iE/san/san.GIST-DVOTMohini.exp0.box
etc

The complete line in training text is:

अपि यत् न एनम् यत् फलं गोदावरी नदी कथं तत्र स्म तं क्रि.श तमे गृहेषु च यज्ञो दानं

text2image is creating a box with part of the line (multiple words instead of one) for some fonts.

 grep 'पि यत् ' *.*
Binary file san.AA_NAGARI_SHREE_L3.exp0.lstmf matches
Binary file san.Adobe_Devanagari.exp0.lstmf matches
Binary file san.Aksharyogini.exp0.lstmf matches
Binary file san.Aksharyogini_Italic.exp0.lstmf matches
Binary file san.Arial_Unicode_MS.exp0.lstmf matches
san.CDAC-GISTSurekh.exp0.box:पि यत् न एनम् यत् फलं 141 1479 562 1545 34
Binary file san.CDAC-GISTSurekh.exp0.lstmf matches
san.CDAC-GISTYogesh.exp0.box:पि यत् न एनम् यत् फलं 141 3635 599 3701 37
Binary file san.CDAC-GISTYogesh.exp0.lstmf matches
san.CDAC-GISTYogesh_Italic.exp0.box:पि यत् न एनम् यत् फलं 130 3073 625 3144 36
Binary file san.CDAC-GISTYogesh_Italic.exp0.lstmf matches
Binary file san.Ek_Mukta.exp0.lstmf matches
Binary file san.FreeSerif.exp0.lstmf matches
Binary file san.Gargi.exp0.lstmf matches
san.GIST-DVOTKishor.exp0.box:पि यत् न एनम् यत् फलं 147 3936 604 4008 38
Binary file san.GIST-DVOTKishor.exp0.lstmf matches
san.GIST-DVOTMohini.exp0.box:पि यत् न एनम् यत् फलं 128 3091 590 3156 36
Binary file san.GIST-DVOTMohini.exp0.lstmf matches
san.GIST-MROTDhruv.exp0.box:पि यत् न एनम् यत् फलं 137 2362 616 2430 35
Binary file san.GIST-MROTDhruv.exp0.lstmf matches
san.GIST-MROTVinit.exp0.box:पि यत् न एनम् यत् फलं 140 2361 574 2433 35
Binary file san.GIST-MROTVinit.exp0.lstmf matches
Binary file san.GIST-SDOTDhruv.exp0.lstmf matches
san.GIST-SDOTVinit.exp0.box:पि यत् न एनम् यत् फलं 140 3637 617 3699 37
Binary file san.GIST-SDOTVinit.exp0.lstmf matches
Binary file san.Gotu.exp0.lstmf matches
Binary file san.Jaipur_Unicode_NFLC.exp0.lstmf matches
Binary file san.JanaHindi.exp0.lstmf matches
Binary file san.JanaMarathi.exp0.lstmf matches
Binary file san.JanaSanskrit.exp0.lstmf matches
Binary file san.Kokila_Bold_Italic.exp0.lstmf matches
Binary file san.Kokila.exp0.lstmf matches
Binary file san.Kokila_Italic.exp0.lstmf matches
Binary file san.Lohit_Devanagari.exp0.lstmf matches
Binary file san.Lohit_Marathi.exp0.lstmf matches
Binary file san.Mangal.exp0.lstmf matches
Binary file san.Martel.exp0.lstmf matches
Binary file san.Murty_Hindi.exp0.lstmf matches
Binary file san.Murty_Sanskrit.exp0.lstmf matches
Binary file san.Nakula.exp0.lstmf matches
Binary file san.Pragati_Narrow.exp0.lstmf matches
Binary file san.Ranga_Italic.exp0.lstmf matches
Binary file san.Sahadeva.exp0.lstmf matches
Binary file san.Sahitya.exp0.lstmf matches
Binary file san.Samyak_Devanagari_Medium.exp0.lstmf matches
Binary file san.Sanskrit_2003.exp0.lstmf matches
Binary file san.Sarai.exp0.lstmf matches
Binary file san.Shobhika_Bold.exp0.lstmf matches
Binary file san.Shobhika.exp0.lstmf matches
Binary file san.SHREE-DV0726-OT.exp0.lstmf matches
Binary file san.Siddhanta-cakravat.exp0.lstmf matches
Binary file san.Siddhanta-Calcutta.exp0.lstmf matches
Binary file san.Siddhanta.exp0.lstmf matches
Binary file san.Sumana.exp0.lstmf matches
Binary file san.Sura.exp0.lstmf matches
Binary file san.Utsaah_Bold_Italic.exp0.lstmf matches
Binary file san.Utsaah.exp0.lstmf matches
Binary file san.Utsaah_Italic.exp0.lstmf matches
Binary file san.Uttara.exp0.lstmf matches
Binary file san.Vesper_Libre_Medium.exp0.lstmf matches
san.Yashomudra_Bold_Italic.exp0.box:पि यत् न एनम् यत् फलं 118 3868 605 3945 40
Binary file san.Yashomudra_Bold_Italic.exp0.lstmf matches
san.Yashomudra.exp0.box:पि यत् न एनम् यत् फलं 112 3869 598 3944 40
Binary file san.Yashomudra.exp0.lstmf matches
san.Yashomudra_Italic.exp0.box:पि यत् न एनम् यत् फलं 119 3870 604 3944 40
Binary file san.Yashomudra_Italic.exp0.lstmf matches
san.YashomudraLight_Italic.exp0.box:पि यत् न एनम् यत् फलं 120 3872 604 3945 40
Binary file san.YashomudraLight_Italic.exp0.lstmf matches
san.YashomudraMedium_Italic.exp0.box:पि यत् न एनम् यत् फलं 119 3869 604 3944 40
Binary file san.YashomudraMedium_Italic.exp0.lstmf matches
san.YashomudraSemiBold_Bold_Italic.exp0.box:पि यत् न एनम् यत् फलं 118 3868 606 3945 40
Binary file san.YashomudraSemiBold_Bold_Italic.exp0.lstmf matches

Shreeshrii avatar Apr 17 '18 11:04 Shreeshrii

Still getting the error

Loaded 2520/2520 pages (1-2520) of document /tmp/tmp.86Rxa2s1ug/san/san.Gotu.exp0.lstmf Bad box coordinates in boxfile string! ण्दिग्धया मुह्यमाना वक्र समवर्त 641 3699 1326 3774 2 Bad box coordinates in boxfile string! मि। अभीष्टदायै नमः - सर्वा 377 243 968 313 7 Bad box coordinates in boxfile string! चित्तता घभस्तय अमन्यत अर्घ्या 646 2802 1326 2858 8 Bad box coordinates in boxfile string! भिद्रुता अज्ञायमाना तान्यल्प अर्थे 576 249 1278 328 8 Bad box coordinates in boxfile string! न्धिमहोदयस्य च जनवरी-मार्च 1107 3142 1764 3211 14 Bad box coordinates in boxfile string! ल्पितवान् । . नॄन् । अर्बु 965 927 1442 1011 25 Bad box coordinates in boxfile string! र्षिणा प्रपतन् वासस पुरीसानां 1075 1973 1710 2051 29 Bad box coordinates in boxfile string! विष्टैः त्वरन्न् जागाद दानः मूर्ते 924 3351 1543 3430 36 Bad box coordinates in boxfile string! किरत् ६अ अनु ॥ एकवर्जं 1122 934 1658 1014 44 Bad box coordinates in boxfile string! र्किकाश्छान्दसाश्चैव च महता उक्तं 644 4406 1339 4474 51 Bad box coordinates in boxfile string! ति लेख्यौ प्रथमवृत्ते तयोः पौर्वा 437 4410 1070 4494 63 Bad box coordinates in boxfile string! र्दिते अपेक्षः सकुण्डलैः सुतपसं 496 259 1133 334 65 Bad box coordinates in boxfile string! धिना मामुद्धतु| । ॥६ ॥ राजा ” पटौ ४० ॥ मूर्ध 130 3214 1063 3304 70 Bad box coordinates in boxfile string! ति यदा शुक्र उवाच = अर्जु 337 4514 874 4587 72 Bad box coordinates in boxfile string! णि च उवाच। सँस्रव' वर्द्ध 1096 495 1604 563 76 Bad box coordinates in boxfile string! र्किकाश्छान्दसाश्चैव च महता उक्तं 147 2111 842 2180 86 Bad box coordinates in boxfile string! ति । आसीत् । शत्रुघ्न स्वर्ग 1406 4624 1948 4704 91 Bad box coordinates in boxfile string! रिवारवान् प्रध्मापयत् ययावर्जु 905 4039 1539 4118 100 Bad box coordinates in boxfile string! मिथुनम् कारयेत् | ये पर्यु 509 1964 1040 2042 103

Shreeshrii avatar Jun 10 '18 13:06 Shreeshrii

I got a similar error, it's fixing only changing the box coordinates, probably, you did not take into account the fact that the ordinate count in a tesseract starts from the bottom

MikhailesU avatar Sep 09 '23 09:09 MikhailesU