paper-qa pdf parsing doesn't handle multi-column papers correctly

pdf parsing doesn't handle multi-column papers correctly

Open markenki opened this issue 7 months ago • 2 comments

In readers.py, the text extracted from multi-column pdf documents doesn't respect columns, i.e., the text continues across columns. To fix this, the following line:

text = page.get_text("text", sort=True)

should be replaced by these lines:

# Extract text blocks from the page
blocks = page.get_text("blocks")
# Concatenate text blocks, which are already in the correct order, into a single string
text = "\n".join(block[4] for block in blocks)

I'd submit a pull request, but it seems I don't have sufficient permissions to do so.

Thanks!

Apr 16 '25 22:04 markenki

paper-qa paper-qa copied to clipboard

pdf parsing doesn't handle multi-column papers correctly

paper-qa
paper-qa copied to clipboard