tika icon indicating copy to clipboard operation
tika copied to clipboard

Fix for TIKA-2581 contributed by ewanmellor.

Open ewanmellor opened this issue 7 years ago • 3 comments

TesseractOCRParserTest.testOCROutputsHOCR fails with Tesseract 4.0.

With 3.x, the output is <span>Happy</span> but with 4.0 the output is <span><strong>Happy</strong></span>. Both these seem reasonable to me, so update the test to accept either of them.

ewanmellor avatar Feb 21 '18 20:02 ewanmellor

I wonder if it's be better to either strip the <strong> tags out before comparing, or just check for Happy</ instead?

Gagravarr avatar Feb 21 '18 21:02 Gagravarr

@Gagravarr Both of those options would work, but I don't see how either of them are any better.

ewanmellor avatar Feb 21 '18 21:02 ewanmellor

At this point, does it make sense to support Tesseract3 when running tests? Maybe update the documentation https://cwiki.apache.org/confluence/display/TIKA/TikaOCR that the output format is slightly different?

epugh avatar Oct 21 '19 21:10 epugh