PyMuPDF
PyMuPDF copied to clipboard
PyMuPDF Pro 1.25.0: A 1-page .docx file is split into 4 pages
Description of the bug
For some documents, PyMuPDF Pro splits the document into many more pages than if I open the document with Google Docs (or Mac Pages/libreoffice).
This creates several downstream problems (example: exporting first page as png via page.get_pixmap().tobytes(output="png") won't match the expected first page).
How to reproduce the bug
Download the attached 1page-is-split-into-4pages.docx and run
import pymupdf.pro
pymupdf.pro.unlock() # use a trial key to see output of 4th page etc.
document = pymupdf.open("1page-is-split-into-4pages.docx")
for page in document:
print(page)
print(page.get_text())
and observe that pymupdf recognizes 4 pages, although if you open it in Google Docs (or Mac's Pages, or libreoffice), it shows as 1 page.
PyMuPDF version
1.25.0
Operating system
Linux
Python version
3.9