PyMuPDF
PyMuPDF copied to clipboard
Memory leaks when merging PDFs
Description of the bug
Hello,
First of all, thank you for all the work you've been putting into this project. Last November, I reported a minor memory leak issue related to the save() function, which was promptly addressed and fixed. Thank you for that. However, I've encountered memory leaks under different scenarios since that fix.
Issue Overview:
I've noted memory leaks during the process of merging PDFs. To delve deeper into this issue, I created a test suite of 150 random PDFs from our dataset. Of these, 44 PDFs were identified to cause memory leaks upon merging with another PDF. Included are 42 of these PDFs (I had to remove 2 to respect Github's file size cap) Unfortunately, I couldn't pinpoint the specific issues within each PDF.
Contents of the provides archive
- A
tests
directory, containing a subdirectory for each of the 44 test cases where leaks were observed. -
reproduce.py
: A script (referenced in https://github.com/pymupdf/PyMuPDF/issues/2791) that executes 500 merges for each test case to simulate the issue. -
test_leaks.py
: A script that automates the running of reproduce.py across all identified leaking cases.
How to reproduce the bug
- Unpack the archive into a directory of your choice.
- Within a Python 3.11 virtual environment, install the required dependencies using pip install -r requirements.txt.
- Execute the command python test_leaks.py run.
Expected files after execution
Each test case directory will include:
-
content.pdf
: The PDF file used in the test. -
coverpage.pdf
: A common PDF file merged with content.pdf in each test, identical across all tests. -
plot.dat
: Memory usage data, which can be visualized with mprof.
To review the memory usage graphs, please use the following command from within the tests directory: for test in $(ls); do mprof plot $test/plot.dat; done
I hope this information aids in troubleshooting the issue. I'm available to provide any further assistance that might be helpful in resolving this.
PyMuPDF version
1.23.25
Operating system
Linux
Python version
3.11
I think i might have found a bug in MuPDF's C++ bindings that could be causing these leaks.
The fix has been pushed to MuPDF branch master
.
We're hoping that a new MuPDF release branch will be made soon from MuPDF master, which should fix the issue for the subsequent PyMuPDF release.
[The current PyMuPDF branch main
in git works with MuPDF branch master
so if you want to try the fix before then, you could use the instructions at https://pymupdf.readthedocs.io/en/latest/installation.html#build-and-install-from-local-pymupdf-checkout-and-optional-local-mupdf-checkout to build PyMuPDF yourself.]
Fixed in 1.24.0.