paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Fix pdf parsing bug

Open markenki opened this issue 7 months ago • 1 comments

The line text = page.get_text("text", sort=True) in readers.py doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.

Replacing that line of code with

# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)

# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks)

extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.

markenki avatar Apr 16 '25 23:04 markenki