oldnyc
oldnyc copied to clipboard
Merge lines using bounding boxes
I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.
But it would be better to do this with the bounding boxes from ocropus-gpageseg
. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.
Vertical gaps between lines could also be used as cues here.
While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.
This would be done in extract_ocropy_text.py
.
722041f is an interesting case here. The small line (east side.
) between paragraphs should be joined to the first.