pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

Huge memory/cpu utilization for 1 page PDF extraction

Open alisufian opened this issue 10 years ago • 8 comments

extractText() cpu/memory utilization is massive for the following 1 page 3 MB file. The extraction doesn't complete and the process has to be killed.

http://www.dora.state.co.us/pls/efi/efi_p2_v2_demo.show_document?p_dms_document_id=105933&p_session_id=

alisufian avatar Feb 20 '14 07:02 alisufian

More evidence that extractText() needs work. It is a very complex file, however, and opening it in a PDF viewer was somewhat difficult for my system.

mstamy2 avatar Feb 20 '14 23:02 mstamy2

Hi Matt, Thank you for the prompt response. I agree the page is kind of heavy. Takes some time to render on my machine as well.

I greatly appreciate your time and effort on creating and maintaining this library.

alisufian avatar Feb 21 '14 06:02 alisufian

@alisufian I'm currently looking into performance topics. The link seems not to load on my machine. Do you still have that document somewhere?

MartinThoma avatar Jun 14 '22 19:06 MartinThoma

@MartinThoma Here you are the doc (download was OK for me) PUC_Quiet crossing_school boundaries_11X17.pdf

I've opened the PDF in Acrobat reader and this file looks very heavy (lot of drawing/images?).

pubpub-zz avatar Jun 14 '22 20:06 pubpub-zz

mine

This callgraph was created via:

$ python -m cProfile -o profile.pstats script.py
$ gprof2dot -f pstats profile.pstats | dot -Tsvg -o mine.svg

with

from PyPDF2 import PdfReader

reader = PdfReader('PUC_Quiet.crossing_school.boundaries_11X17.pdf')
text = ""
for page in reader.pages:
    text += page.extract_text()

MartinThoma avatar Jun 19 '22 19:06 MartinThoma

Some micro-benchmarks on Python 3.6:

peek in (b"\r", b"\n") vs peek in b"\r\n" vs peek == b"\r" or peek == b"\n"

In [8]: %timeit peek not in b"\r\n"
258 ns ± 3.11 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [11]: %timeit a != b"\r" and a != b"\n"
65.9 ns ± 0.618 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [9]: %timeit peek not in (b"\r", b"\n")
56 ns ± 0.351 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

That is surprising to me.

Decimal instanciation

In [13]: %timeit Decimal(0)
168 ns ± 0.3 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

In [14]: %timeit Decimal("0")
223 ns ± 1.89 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

In [15]: %timeit Decimal(0.0)
430 ns ± 4.32 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

MartinThoma avatar Jun 19 '22 19:06 MartinThoma

After applying https://github.com/py-pdf/PyPDF2/pull/1014 :

mine

MartinThoma avatar Jun 19 '22 20:06 MartinThoma

One of the performance killers is creating the FloatObjects. We do it 6 million times in this example and it is 8% of the workload. But remplacing the FloatObject with normal floats (or decimals) would be a quite massive change.

MartinThoma avatar Jun 19 '22 20:06 MartinThoma

@MartinThoma with the new PR can you rerun to check for improvements?

pubpub-zz avatar Feb 05 '23 08:02 pubpub-zz

I'm closing this issue now as I don't have any further approach to speed this up (except for writing a C/C++/Rust module, which would be a very different beast)

MartinThoma avatar Feb 28 '23 07:02 MartinThoma