paper-qa
paper-qa copied to clipboard
Fix pdf parsing bug
The line
text = page.get_text("text", sort=True)
in readers.py doesn't respect multiple columns. For example, applied to pasa.pdf (in tests/stub_data), the first line of text is extracted as "We introduce PaSa, an advanced Paper Search Academic paper search lies at the core of research" but the first half of that comes from the first column while the second half comes from the second column.
Replacing that line of code with
# Extract text blocks, which are already in the correct order, from the page
blocks = page.get_text("blocks", sort=False)
# Concatenate text blocks into a single string
text = "\n".join(block[4] for block in blocks)
extracts this text: "We introduce PaSa, an advanced Paper Search\nagent powered by large language models.", which is correct.