tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Ground truth: spaces before and after text?

Open jbarth-ubhd opened this issue 2 years ago • 10 comments

I've created *.exp0.gt.txt as a base for manual ground truth creation using Shreeshrii's shell script and the files contain a space before and after the text (no newlines etc). Example:

01127778-001.exp0.gt.txt: " Verfahrenstechnik / -70 B 2813 "
01127778-002.exp0.gt.txt: " Verfahrenstechnik, Forschung und Lehre, (Zsgest. Ue "
01127778-003.exp0.gt.txt: " verf. von Kurt Schiefer und Kurt Boekmer,) "
01127778-004.exp0.gt.txt: " Düsseldorf: Verfahrenstechnische Ges, im Verein Deut- ! "
01127778-005.exp0.gt.txt: " scher Ingenieure 1967. 185 S, 8° "
01127778-006.exp0.gt.txt: " Frühere Ausg. 3. u.d.T.: Verfahrenstechnik im In- und "
01127778-007.exp0.gt.txt: " Ausland. "

... but The-Hallucination-Effect states »Example 2: Your training text frequently includes a Space at the beginning of your sentences or at the end. Might result in slow training, non-convergence & even model corruption.«

My Question: Spaces or not?

The 1 line images are very tight, no blank space before/after; example: grafik

jbarth-ubhd avatar Feb 16 '23 09:02 jbarth-ubhd

ok, ocrd-testset.zip *.gt.txt contain no spaces before/after, but \n

jbarth-ubhd avatar Feb 16 '23 10:02 jbarth-ubhd

ok, ocrd-testset.zip *.gt.txt contain no spaces before/after, but \n

@jbarth-ubhd, I see there is no space or no new line at the end of the *.gt.txt

vishakraj25 avatar Feb 24 '23 06:02 vishakraj25

I'll see newlines (4th line below):

jb@xxx:~/Downloads/ocrd-testset> cat *.gt.txt|od -c|head
0000000   i   c   h       d   e   n   k   e   .       A   b   e   r    
0000020   w   a   s       d   i   e     305 277   e   l   i   g   e    
0000040   F   r   a   u       G   e   h   e   i   m   r 303 244   t   h
0000060   i   n  \n 342 200 236   D   a   s       k   a   n   n       i
0000100   c   h       n   i   c   h   t   ,       c   '   e   s   t    

jbarth-ubhd avatar Feb 24 '23 08:02 jbarth-ubhd

Ground truth line text must not have spaces before or after the text. It may end with a linefeed (which gets added automatically by many editors).

stweil avatar Mar 01 '23 05:03 stweil

Just tried it again with https://github.com/tesseract-ocr/tesstrain/issues/7 and https://github.com/ocropus/hocr-tools/blob/master/hocr-extract-images , the generated .exp0.gt.txt files contain spaces before & after:

308-119.exp0.gt.txt: " | == zz NN NN ANNE NZZ SE anli : "
308-120.exp0.gt.txt: " <C3><BC>ber 1 BONS DD DD SS EN U = NS utfer]pras "
308-121.exp0.gt.txt: " Datei ihrem unfeligen zZ <E2><80><94> SS AN . LEE KA 5 XS Ode bot. "
308-122.exp0.gt.txt: " N ein SH 7 DD SS 7 ea san Zn EFF LEE Z<E2><80><94>_-- BEN UNE x "

jbarth-ubhd avatar Mar 01 '23 08:03 jbarth-ubhd

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 22 '23 01:05 stale[bot]

bump

jbarth-ubhd avatar Jun 28 '23 11:06 jbarth-ubhd

Just tried it again with #7 and https://github.com/ocropus/hocr-tools/blob/master/hocr-extract-images , the generated .exp0.gt.txt files contain spaces before & after

Then I assume that the original data (hOCR) already contains such spaces. Do you have a link to an example?

stweil avatar Jun 28 '23 12:06 stweil

The .hocr does not contain spaces: <span ...>abcdefg</span>, but the .exp0.gt.txt does so.

See https://digi.ub.uni-heidelberg.de/diglitData/v/tesstrain-issue-335.zip for a complete test environment; main script is Shreeshrii-script.

jbarth-ubhd avatar Jun 28 '23 13:06 jbarth-ubhd

The spaces before and after the line occur, if your hocr file is indented.

hocr-extract-images uses regex to replace one (or more) whitespace characters with one space. see line 20

i am not sure if they had indentation in mind, though.

jbollacke avatar Nov 09 '23 06:11 jbollacke