PdfPig
PdfPig copied to clipboard
hocr line combination
How can this be an ocr_line?
Forgot to say .... this seems to be an issue only on the second page of a document.
It's not really clear what the issue is but I'd say you have several word in the hocr export that correspond to a single word in the document.
This might be related to the fact that this document prints several letters on top of each other to simulate bold letters, or it might just be an error.
Try using the following to clean your letters before using them:
var letters = DuplicateOverlappingTextProcessor.Get(page.Letters);
The 4 span.ocrx_word are not in one line but in a column (As per Layout/Design). So below each other, see bbox. On all other pages except the second page it is correct or different: On all other pages the words in one line are combined into a ocr_line, but on page 2 (sometimes?) words in a column are combined into a ocr_line
Can you provide the pdf document? Also, can you detail how you did the extraction (provide the code)? Without that, it will be extremely difficult to understand your issue and help. Thanks