tika
tika copied to clipboard
Fix for TIKA-2581 contributed by ewanmellor.
TesseractOCRParserTest.testOCROutputsHOCR fails with Tesseract 4.0.
With 3.x, the output is <span>Happy</span> but with 4.0 the output is
<span><strong>Happy</strong></span>. Both these seem reasonable to me,
so update the test to accept either of them.
I wonder if it's be better to either strip the <strong> tags out before comparing, or just check for Happy</ instead?
@Gagravarr Both of those options would work, but I don't see how either of them are any better.
At this point, does it make sense to support Tesseract3 when running tests? Maybe update the documentation https://cwiki.apache.org/confluence/display/TIKA/TikaOCR that the output format is slightly different?