PdfPig icon indicating copy to clipboard operation
PdfPig copied to clipboard

hocr line combination

Open MarcGrotheer opened this issue 4 years ago • 4 comments

How can this be an ocr_line?

MarcGrotheer avatar Mar 30 '21 10:03 MarcGrotheer

Forgot to say .... this seems to be an issue only on the second page of a document.

MarcGrotheer avatar Mar 30 '21 10:03 MarcGrotheer

It's not really clear what the issue is but I'd say you have several word in the hocr export that correspond to a single word in the document.

This might be related to the fact that this document prints several letters on top of each other to simulate bold letters, or it might just be an error.

Try using the following to clean your letters before using them:

var letters = DuplicateOverlappingTextProcessor.Get(page.Letters);

BobLd avatar Apr 21 '21 21:04 BobLd

The 4 span.ocrx_word are not in one line but in a column (As per Layout/Design). So below each other, see bbox. On all other pages except the second page it is correct or different: On all other pages the words in one line are combined into a ocr_line, but on page 2 (sometimes?) words in a column are combined into a ocr_line

MarcGrotheer avatar Jul 26 '21 13:07 MarcGrotheer

Can you provide the pdf document? Also, can you detail how you did the extraction (provide the code)? Without that, it will be extremely difficult to understand your issue and help. Thanks

BobLd avatar Aug 17 '21 14:08 BobLd