tika Fix for TIKA-2581 contributed by ewanmellor.

Fix for TIKA-2581 contributed by ewanmellor.

Open ewanmellor opened this issue 7 years ago • 3 comments

TesseractOCRParserTest.testOCROutputsHOCR fails with Tesseract 4.0.

With 3.x, the output is Happy but with 4.0 the output is Happy. Both these seem reasonable to me, so update the test to accept either of them.

Feb 21 '18 20:02 ewanmellor

I wonder if it's be better to either strip the  tags out before comparing, or just check for Happy</ instead?

Feb 21 '18 21:02 Gagravarr

@Gagravarr Both of those options would work, but I don't see how either of them are any better.

Feb 21 '18 21:02 ewanmellor

At this point, does it make sense to support Tesseract3 when running tests? Maybe update the documentation https://cwiki.apache.org/confluence/display/TIKA/TikaOCR that the output format is slightly different?

Oct 21 '19 21:10 epugh

tika tika copied to clipboard

Fix for TIKA-2581 contributed by ewanmellor.

tika
tika copied to clipboard