oldnyc icon indicating copy to clipboard operation
oldnyc copied to clipboard

Merge lines using bounding boxes

Open danvk opened this issue 9 years ago • 1 comments

I'm currently doing this with the OCR'd text directly, mostly out of expedience. Lines with similar widths are joined.

But it would be better to do this with the bounding boxes from ocropus-gpageseg. For example, in 712393b, the first line of the paragraph is indented. The right edges of the lines in the paragraph are all close to one another, even though the first line has fewer characters.

Vertical gaps between lines could also be used as cues here.

While I'm at it, it would also be better to detect "NO REPRODUCTIONS"-style lines on a per-box basis, since these sometimes get merged with dates or attributions.

This would be done in extract_ocropy_text.py.

danvk avatar Apr 30 '15 14:04 danvk

722041f is an interesting case here. The small line (east side.) between paragraphs should be joined to the first.

danvk avatar Apr 30 '15 14:04 danvk