benchmarks
benchmarks copied to clipboard
pdfrw vs pypdf page extraction & merge
Test run on Python 3.8, Windows 7:
- I took 4 arbitrary page numbers (pages 4,6,8,9).
- For each of the benchmark listed pdf files I extracted those pages from it (if available).
- Then I created a new pdf using the extracted pages, and repeated them between 1 and 5 times (to check how well pdfrw / pypdf optimize size of created pdfs containing repetitive information). So output pdfs will have up to 4x5 = 20 pages
- I measure time employed and output sizes
I recall my initial code also deleted original bookmarks/annotations from pdfs, but I removed that part for simplicity and commented where I had read about that.
Code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
def fsize(filepath):
import os
finfo = os.stat(filepath)
fsize = finfo.st_size
KB = "%.2f" % (fsize/1024)
return([fsize,KB])
#@profile
def createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=None, pageslist=None, destpdf=None, debug=False):
""" <https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py>
"""
from pdfrw import PdfWriter,PdfReader
#import pdfrw_bookmarks # code from https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546
pages = PdfReader(sourcepdf).pages
totalpages = len(pages)
outdata = PdfWriter(destpdf)
for p in pageslist:
if p<totalpages:
if debug: print("pdfrw ",p)
#pdfrw_pageannots(pages[p-1])
outdata.addpage(pages[p-1])
outdata.write()
#@profile
def createpdf_from_sourcepdf_pages_pypdf(sourcepdf=None, pageslist=None, destpdf=None, debug=False, compress=False):
""" Generate destpdf with list of certain pages taken from sourcepdf.
- <https://pypdf2.readthedocs.io/en/stable/user/merging-pdfs.html>
- SO [Extract specific pages of PDF and save it with Python](https://stackoverflow.com/a/51885963/710788)
"""
from pypdf import PdfWriter,PdfReader
fsource = open(sourcepdf, "rb")
merger = PdfWriter()
totalpages = len(PdfReader(fsource).pages)
for p in pageslist:
if p<totalpages:
if debug: print("pypdf ",p)
# add page p (0-based index):
merger.append(fileobj=fsource, pages=(p-1,p))
if compress: # Compress the data
for page in merger.pages:
page.compress_content_streams() # This is CPU intensive!
# Write to an output PDF document
output = open(destpdf, "wb")
merger.write(output)
# Close File Descriptors
merger.close()
output.close()
#from memory_profiler import profile
#@profile
def pypdf_vs_pdfrw():
""" [performance comparative](https://github.com/pmaupin/pdfrw/issues/232#issuecomment-1436153435) between two packages:
- pdfrw
- pypdf
"""
print(datetime.now() - startTime, " before comparing")
pdfurls = [
"https://arxiv.org/pdf/2201.00151.pdf",
"https://arxiv.org/pdf/1707.09725.pdf",
"https://arxiv.org/pdf/2201.00021.pdf",
"https://arxiv.org/pdf/2201.00037.pdf",
"https://arxiv.org/pdf/2201.00069.pdf",
"https://arxiv.org/pdf/2201.00178.pdf",
"https://arxiv.org/pdf/2201.00201.pdf",
"https://arxiv.org/pdf/1602.06541.pdf",
"https://arxiv.org/pdf/2201.00200.pdf",
"https://arxiv.org/pdf/2201.00022.pdf",
"https://arxiv.org/pdf/2201.00029.pdf",
"https://arxiv.org/pdf/1601.03642.pdf",
]
import requests,os
pdfrw_Tsize = 0
pdfrw_Ttime = 0
pypdf_Tsize = 0
pypdf_Ttime = 0
for pdfurl in pdfurls:
sourcepdf = pdfurl.split("/")[-1]
if not os.path.exists(sourcepdf):
response = requests.get(pdfurl, headers=None, params=None)
if response.status_code == 200:
with open(sourcepdf, 'wb') as f:
f.write(response.content)
else:
print(response.status_code)
print("COULDN'T DOWNLOAD '{}' FILE:\n".format(pdfurl))
if not os.path.exists(sourcepdf):
print("\n","-_"*40,"\n\nSKIPPING '{}' FILE:\n".format(sourcepdf))
else:
print("\n","-_"*40,"\n\nTESTING WITH '{}' FILE:\n".format(sourcepdf))
for i in range(1,6):
pageslist=[4,6,8,9]*i #*5 eats all my memory when using pypdf with large pdf files
print("-"*50,"\npageslist:",pageslist)
start=datetime.now()
destpdf=sourcepdf+"_pdfrw-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
pdfrw_t = round((datetime.now() - start).total_seconds(),3)
pdfrw_s = fsize(destpdf)
pdfrw_Ttime += pdfrw_t
pdfrw_Tsize += pdfrw_s[0]
print("pdfrw: {} KB output size, took {} seconds".format(pdfrw_s[1],pdfrw_t))
start=datetime.now()
destpdf=sourcepdf+"_pypdf-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
createpdf_from_sourcepdf_pages_pypdf(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
pypdf_t = round((datetime.now() - start).total_seconds(),3)
pypdf_s = fsize(destpdf)
pypdf_Ttime += pypdf_t
pypdf_Tsize += pypdf_s[0]
print("pypdf: {} KB output size, took {} seconds".format(pypdf_s[1],pypdf_t))
print("pypdf_time / pdfrw_time = {} ratio".format(round(pypdf_t/pdfrw_t, 2)))
print("pypdf_size / pdfrw_size = {} ratio".format(round(pypdf_s[0]/pdfrw_s[0], 2)))
import pdfrw,pypdf
print("-_"*40)
print("\n pdfrw.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
pdfrw.__version__, pdfrw_Tsize/1024/1024, pdfrw_Ttime))
print("\n pypdf.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
pypdf.__version__, pypdf_Tsize/1024/1024, pypdf_Ttime))
if __name__ == "__main__":
import sys
from datetime import datetime
startTime = datetime.now()
print("START: ",startTime)
pypdf_vs_pdfrw()
endTime = datetime.now()
print("\nEND: ",endTime)
print("\nTOTAL TIME: ",endTime-startTime)
OUTPUT:
START: 2023-07-01 22:06:17.718288
0:00:00 before comparing
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
TESTING WITH '2201.00151.pdf' FILE:
--------------------------------------------------
pageslist: [4, 6, 8, 9]
pdfrw: 591.29 KB output size, took 0.109 seconds
pypdf: 660.78 KB output size, took 0.499 seconds
pypdf_time / pdfrw_time = 4.58 ratio
pypdf_size / pdfrw_size = 1.12 ratio
--------------------------------------------------
(... LINES DELETED TO AVOID TOO LONG OUTPUT ...)
--------------------------------------------------
pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 130.20 KB output size, took 0.047 seconds
pypdf: 836.60 KB output size, took 1.031 seconds
pypdf_time / pdfrw_time = 21.94 ratio
pypdf_size / pdfrw_size = 6.43 ratio
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_
pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.47 seconds
pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 108.14 seconds
END: 2023-07-01 22:08:11.767827
TOTAL TIME: 0:01:54.049539
Thank you for sharing!
I just updated my benchmark.py code + made a run with the latest libraries ... and I realized that it's super hard to understand :see_no_evil: I need to simplify it to make it easier to extend.
I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: https://github.com/py-pdf/benchmarks/commit/a78f609d3d4d0d6a298c72c00ddc05b4d35fce53
@abubelinha Did you use https://pypi.org/project/pdfrw/ or https://pypi.org/project/pdfrw2/?
EDIT: sorry, I misunderstood (thought you were asking about pypdf)
I asked this question and then used this installation:
pip install pdfrw2
So my current output says it's sarnold's pdfrw version 0.5.0 But my original test was done in February 20th 2023, so I guess that used pmaupin's pdfrw, probably version 0.4.0 (that script didn't output versions; I incorporated that code yesterday).
Anyway, differences vs pypdf were pretty much the same.
Thank you :pray:
I just added pdfrw to my comparision with watermarking speed/file size: commit / rendered output
It's crazy. I did expect that it's faster, but not that much. And I didn't expect a difference in file size.
pdfrw did an awesome job there. Thank you for pointing that out!
What about my pypdf code implementation? Do you see something wrong?
I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: https://github.com/py-pdf/benchmarks/commit/a78f609d3d4d0d6a298c72c00ddc05b4d35fce53
I edited my above code
- I avoid now counting pages multiple times (moved "
len(PdfReader(fsource).pages)" out of theforloop). - I added a boolean
compressparameter to my pypdf-based function, and copied your compression loop like this:
if compress: # Compress the data
for page in merger.pages:
page.compress_content_streams() # This is CPU intensive!
Do you see something else to change on my pypdf usage?
Change 1 reduced total pypdf time in a few seconds (still not using compression: compress=False)
pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.59 seconds
pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 93.47 seconds
Change 2 didn't work as expected:
Oddly, when using compress=True, my results were even worse.
Not only because of the much longer time delay (which I would expect since it now runs the compression loop).
It also produced a bigger output size:
pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.52 seconds
pypdf.__version__ 3.2.0
Accumulated output file size: 203.83 MB
Total time: 237.30 seconds
So I'd say there must be something wrong either in my pypdf code or in the pypdf compression algorithm. Perhaps compression is only useful for some kind of PDF files? (i.e. those including uncompressed images or something alike). But I used the same pdf files proposed for your benchmark tests, so I am surprised that "compressed" are bigger than "uncompressed".
Indeed, total times vary a little bit between identical script runs (I guess it depends on other non-python processes running in my computer).
I've just noticed that you used the same files as I did in my benchmark. Nice!