pypdf
pypdf copied to clipboard
Huge memory/cpu utilization for 1 page PDF extraction
extractText() cpu/memory utilization is massive for the following 1 page 3 MB file. The extraction doesn't complete and the process has to be killed.
http://www.dora.state.co.us/pls/efi/efi_p2_v2_demo.show_document?p_dms_document_id=105933&p_session_id=
More evidence that extractText() needs work. It is a very complex file, however, and opening it in a PDF viewer was somewhat difficult for my system.
Hi Matt, Thank you for the prompt response. I agree the page is kind of heavy. Takes some time to render on my machine as well.
I greatly appreciate your time and effort on creating and maintaining this library.
@alisufian I'm currently looking into performance topics. The link seems not to load on my machine. Do you still have that document somewhere?
@MartinThoma Here you are the doc (download was OK for me) PUC_Quiet crossing_school boundaries_11X17.pdf
I've opened the PDF in Acrobat reader and this file looks very heavy (lot of drawing/images?).
This callgraph was created via:
$ python -m cProfile -o profile.pstats script.py
$ gprof2dot -f pstats profile.pstats | dot -Tsvg -o mine.svg
with
from PyPDF2 import PdfReader
reader = PdfReader('PUC_Quiet.crossing_school.boundaries_11X17.pdf')
text = ""
for page in reader.pages:
text += page.extract_text()
Some micro-benchmarks on Python 3.6:
peek in (b"\r", b"\n")
vs peek in b"\r\n"
vs peek == b"\r" or peek == b"\n"
In [8]: %timeit peek not in b"\r\n"
258 ns ± 3.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [11]: %timeit a != b"\r" and a != b"\n"
65.9 ns ± 0.618 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [9]: %timeit peek not in (b"\r", b"\n")
56 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
That is surprising to me.
Decimal instanciation
In [13]: %timeit Decimal(0)
168 ns ± 0.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
In [14]: %timeit Decimal("0")
223 ns ± 1.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
In [15]: %timeit Decimal(0.0)
430 ns ± 4.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
After applying https://github.com/py-pdf/PyPDF2/pull/1014 :
One of the performance killers is creating the FloatObject
s. We do it 6 million times in this example and it is 8% of the workload. But remplacing the FloatObject with normal floats (or decimals) would be a quite massive change.
@MartinThoma with the new PR can you rerun to check for improvements?
I'm closing this issue now as I don't have any further approach to speed this up (except for writing a C/C++/Rust module, which would be a very different beast)