layout-parser
layout-parser copied to clipboard
Layout Parser text boxes not properly aligned causing incorrect sorting of text boxes
Hi,
I'm using layout parser to perform OCR on a research paper, but on almost every page of the pdf the text boxes are not properly aligned. For example I input this page:
perform detection using:
model = lp.Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x/config',
extra_config=["MODEL.ROI_HEADS.SCORE_THRESH_TEST", 0.8],
label_map={0: "Text", 1: "Title", 2: "List", 3:"Table", 4:"Figure"})
layout = model.detect(image)
# Show the detected layout of the input image
lp.draw_box(image, layout, box_width=3)
The detected image is shown below:
As can be seen, the bottom left box is not properly aligned, which causes problem with the sort script, as given in the tutorial:
# sort the left and right blocks and assign id to each
h, w = image.size
left_interval = lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image)
left_blocks = text_blocks.filter_by(left_interval, center=True)
left_blocks.sort(key = lambda b:b.coordinates[1])
right_blocks = [b for b in text_blocks if b not in left_blocks]
right_blocks.sort(key = lambda b:b.coordinates[1])
# And finally combine the two list and add the index
# according to the order
text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)])
# visualize the cleaned text blocks
lp.draw_box(image, text_blocks,
box_width=3,
show_element_id=True)
The misaligned box is given an index of 0
. Which is not correct.
Is there any way to avoid this problem?
Thank you
Thanks - this is more of an issue from the detection model (it's very very hard to generate perfect bounding box detections for these models). I have script that can fix this issue, but could not share with you right now due to some copyright issues -- it should be ready within the next few weeks, and please stay tuned.
Hi there,
First thing, remove 1.05 from the below line. i.e. don't multiply at all.
left_interval = lp.Interval(0, w/2*1.05, axis='x').put_on_canvas(image)
If that does not work for you, Create your own function to append two lists and sort them using y1. Assuming that you only have 2 column layout throughout your document. Two lists to hold left and right should do the work.
text_blocks = lp.Layout([b.set(id = idx) for idx, b in enumerate(left_blocks + right_blocks)])
replace your left and right with their left and right. and ka-boom it works.
Happy coding :)
I have another approach to separate the layouts. if we want to separate left and right layout we can simply Use Kmeans clustering algorithm with number of clusters=2.
if there are two columns, find the median of right column first coordinate and calculate it difference from the all the coordinates of left columns. If the coordinates of left column block is greater than median, remove the block from left column and append it to right column. sort both blocks again. and you are done