Improved merge_texts when dealing with double column documents
Hi, I tried the new merge_line_texts. I think the algorithms assume that the image is a single column. So it performs quite weirdly when dealing with double-column documents, which is often the case. Here is a fishy improved version.
Intuition: when dealing with a double-column document, humans would read like this :
# 在这里处理单列版面 vs 双列版面。
# 二者唯一的区别是,当我们从左向右,从上到下地遍历文档时。
# 双列版面中,那些在版面中线右边的 box 应当暂时保留在 cache 中,直到遇到一个横跨中线的 box,才清空 cache
only_text = ''
right_list = [] # right-side cache.
mid = pix.width / 2
sorted_list = sorted(outs, key=lambda x: min(point[1] for point in x['position']))
global last_layout_id
last_layout_id = -1
for out in sorted_list:
# if cross mid.
if mid > min(point[0] for point in out['position']) and mid < max(point[0] for point in out['position']):
# flush.
for cached in right_list:
only_text += layout_sensitive_str(cached, layout)
right_list = []
only_text += layout_sensitive_str(out, layout)
elif mid * 1.1 > max(point[0] for point in out['position']):
only_text += layout_sensitive_str(out, layout)
else:
right_list.append(out)
# flush.
for cached in right_list:
only_text += layout_sensitive_str(cached, layout)
As you might see I also tried to solve the '\n' problem with a supplementary run of layout analysis.
When the multiple lines of text belongs to a same layout block, \n is no longer needed. Here is a detailed version (sorry for the messy code) :
def intersection_area(bbox1, bbox2):
return the intersection_area of bbox1 & bbox2
def max_layout_id(layout, bbox_8):
return layout_id which maximize intersection_ratio
global last_layout_id # last line layout block id.
def layout_sensitive_str(out, layout):
global last_layout_id
max_ratio, cur_id = max_layout_id(layout, out['position'])
if layout['layout'][cur_id]['label'] in ('Table', 'Figure'):
# 我的任务中不识别
else:
# if in same layout block id. no \n.
# -1 为通配符,与任何 layout block id通配。每页最开始为 -1, 之后遇到连字符也会被重置为 -1
out_text = out['text'].lstrip(' \n').rstrip(' \n')
ret = ''
if len(out_text) <= 0:
return ''
# 如果和之前不属于同一块,换行。例外若之前有连字符 -,使得 last_layout_id = -1,通配。
if cur_id != last_layout_id and out_text != '' and last_layout_id != -1:
ret += '\n'
ret += out_text
# 如果末尾为连字符,删除连字符。
if ret[-1] == '-':
last_layout_id = -1
ret = ret[:-1]
# 一般末尾,加空格
else:
last_layout_id = cur_id
ret += ' '
return ret
Outputs:
original implementation :
Improved :
It would be better if I use a CDLA-based model, which will further sep the footer and reference.
Yes, the current code is not suitable for handling multiple-columns documents. It seems that your implementation has high requirements for the layout of the input, requiring the left and right columns to be the same width and the screenshot to be centered. This implementation may not be suitable for Pix2Text. Currently, it is mainly for random screenshot images, not scanned document images. A better way would be to use layout analysis to do the disaggregation or even chunking first, and then recognize them block by block. By the way, welcome submitting PR's.
PaddleOCR提供了文字识别后的版面恢复模块,或许可以解决双栏ocr后的排版问题,具体参考:https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/PP-StructureV2_introduction.md
pix2text >= 1.1 has integrated layout analysis models for the task of recognizing complex layouts.