paper-qa icon indicating copy to clipboard operation
paper-qa copied to clipboard

Support for full PDF "image text" OCR in pymupdf

Open kvnxiao opened this issue 10 months ago • 0 comments

Can we add some sort of toggle / support for enabling full page OCR reading via Tesseract, when pymupdf is installed? I hacked around the vendored library in my local virtualenv and made a change in readers.py to something like which allows it to work, but an upstream solution would be better:

def parse_pdf_fitz(# ...
# ...
	for i in range(file.page_count):
	        page = file.load_page(i)
	        tp = page.get_textpage_ocr(dpi=300, full=True)
	        page_text = page.get_text(textpage=tp, sort=True)
	        # print(page_text)
	        split += page_text
	        pages.append(str(i + 1))
# ...

kvnxiao avatar Aug 17 '23 22:08 kvnxiao