benchmarks pdfrw vs pypdf page extraction & merge

Test run on Python 3.8, Windows 7:

I took 4 arbitrary page numbers (pages 4,6,8,9).
For each of the benchmark listed pdf files I extracted those pages from it (if available).
Then I created a new pdf using the extracted pages, and repeated them between 1 and 5 times (to check how well pdfrw / pypdf optimize size of created pdfs containing repetitive information). So output pdfs will have up to 4x5 = 20 pages
I measure time employed and output sizes

I recall my initial code also deleted original bookmarks/annotations from pdfs, but I removed that part for simplicity and commented where I had read about that.

Code:

#!/usr/bin/python
# -*- coding: utf-8 -*-

def fsize(filepath):
	import os
	finfo = os.stat(filepath)
	fsize = finfo.st_size
	KB = "%.2f" % (fsize/1024)
	return([fsize,KB])
	
#@profile
def createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=None, pageslist=None, destpdf=None, debug=False):
	""" <https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py>
		"""
	from pdfrw import PdfWriter,PdfReader
	#import pdfrw_bookmarks # code from https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546
	pages = PdfReader(sourcepdf).pages
	totalpages = len(pages)
	outdata = PdfWriter(destpdf)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pdfrw ",p)
			#pdfrw_pageannots(pages[p-1])
			outdata.addpage(pages[p-1])
	outdata.write()

#@profile
def createpdf_from_sourcepdf_pages_pypdf(sourcepdf=None, pageslist=None, destpdf=None, debug=False, compress=False):
	""" Generate destpdf with list of certain pages taken from sourcepdf. 
		- <https://pypdf2.readthedocs.io/en/stable/user/merging-pdfs.html>
		- SO [Extract specific pages of PDF and save it with Python](https://stackoverflow.com/a/51885963/710788)
		"""
	from pypdf import PdfWriter,PdfReader
	fsource = open(sourcepdf, "rb")
	merger = PdfWriter()
	totalpages = len(PdfReader(fsource).pages)
	for p in pageslist:
		if p<totalpages:
			if debug: print("pypdf ",p)
			# add page p (0-based index):
			merger.append(fileobj=fsource, pages=(p-1,p))
	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!
	# Write to an output PDF document
	output = open(destpdf, "wb")
	merger.write(output)
	# Close File Descriptors
	merger.close()
	output.close()

#from memory_profiler import profile
#@profile
def pypdf_vs_pdfrw():
	""" [performance comparative](https://github.com/pmaupin/pdfrw/issues/232#issuecomment-1436153435) between two packages:
		- pdfrw
		- pypdf
		"""
	print(datetime.now() - startTime, " before comparing")
	pdfurls = [
		"https://arxiv.org/pdf/2201.00151.pdf",
		"https://arxiv.org/pdf/1707.09725.pdf",
		"https://arxiv.org/pdf/2201.00021.pdf",
		"https://arxiv.org/pdf/2201.00037.pdf",
		"https://arxiv.org/pdf/2201.00069.pdf",
		"https://arxiv.org/pdf/2201.00178.pdf",
		"https://arxiv.org/pdf/2201.00201.pdf",
		"https://arxiv.org/pdf/1602.06541.pdf",
		"https://arxiv.org/pdf/2201.00200.pdf",
		"https://arxiv.org/pdf/2201.00022.pdf",
		"https://arxiv.org/pdf/2201.00029.pdf",
		"https://arxiv.org/pdf/1601.03642.pdf",
		]
	import requests,os
	pdfrw_Tsize = 0
	pdfrw_Ttime = 0
	pypdf_Tsize = 0
	pypdf_Ttime = 0
	for pdfurl in pdfurls:
		sourcepdf = pdfurl.split("/")[-1]
		if not os.path.exists(sourcepdf):
			response = requests.get(pdfurl, headers=None, params=None)
			if response.status_code == 200:
				with open(sourcepdf, 'wb') as f:
					f.write(response.content)
			else:
				print(response.status_code)
				print("COULDN'T DOWNLOAD  '{}' FILE:\n".format(pdfurl))
		if not os.path.exists(sourcepdf):
			print("\n","-_"*40,"\n\nSKIPPING '{}' FILE:\n".format(sourcepdf))
		else:
			print("\n","-_"*40,"\n\nTESTING WITH '{}' FILE:\n".format(sourcepdf))
			for i in range(1,6):
				pageslist=[4,6,8,9]*i #*5 eats all my memory when using pypdf with large pdf files
				print("-"*50,"\npageslist:",pageslist)
				start=datetime.now()
				destpdf=sourcepdf+"_pdfrw-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pdfrw(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf); 
				pdfrw_t = round((datetime.now() - start).total_seconds(),3)
				pdfrw_s = fsize(destpdf)
				pdfrw_Ttime += pdfrw_t
				pdfrw_Tsize += pdfrw_s[0]
				print("pdfrw: {} KB output size, took {} seconds".format(pdfrw_s[1],pdfrw_t))
				start=datetime.now()
				destpdf=sourcepdf+"_pypdf-test_{}.pdf".format(".".join([str(p) for p in pageslist]))
				createpdf_from_sourcepdf_pages_pypdf(sourcepdf=sourcepdf, pageslist=pageslist, destpdf=destpdf);
				pypdf_t = round((datetime.now() - start).total_seconds(),3)
				pypdf_s = fsize(destpdf)
				pypdf_Ttime += pypdf_t
				pypdf_Tsize += pypdf_s[0]
				print("pypdf: {} KB output size, took {} seconds".format(pypdf_s[1],pypdf_t))
				print("pypdf_time / pdfrw_time = {} ratio".format(round(pypdf_t/pdfrw_t, 2)))
				print("pypdf_size / pdfrw_size = {} ratio".format(round(pypdf_s[0]/pdfrw_s[0], 2)))
			
	import pdfrw,pypdf
	print("-_"*40)
	print("\n pdfrw.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pdfrw.__version__, pdfrw_Tsize/1024/1024, pdfrw_Ttime))
	print("\n pypdf.__version__ {}\nAccumulated output file size: {:.2f} MB\nTotal time: {:.2f} seconds".format(
		pypdf.__version__, pypdf_Tsize/1024/1024, pypdf_Ttime))

if __name__ == "__main__":
	import sys
	from datetime import datetime
	startTime = datetime.now()
	print("START: ",startTime)
	pypdf_vs_pdfrw()
	endTime = datetime.now()
	print("\nEND: ",endTime)
	print("\nTOTAL TIME: ",endTime-startTime)

OUTPUT:

START:  2023-07-01 22:06:17.718288
0:00:00  before comparing

 -_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

TESTING WITH '2201.00151.pdf' FILE:

--------------------------------------------------
pageslist: [4, 6, 8, 9]
pdfrw: 591.29 KB output size, took 0.109 seconds
pypdf: 660.78 KB output size, took 0.499 seconds
pypdf_time / pdfrw_time = 4.58 ratio
pypdf_size / pdfrw_size = 1.12 ratio
--------------------------------------------------
(... LINES DELETED TO AVOID TOO LONG OUTPUT ...)
--------------------------------------------------
pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 130.20 KB output size, took 0.047 seconds
pypdf: 836.60 KB output size, took 1.031 seconds
pypdf_time / pdfrw_time = 21.94 ratio
pypdf_size / pdfrw_size = 6.43 ratio
-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.47 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 108.14 seconds

END:  2023-07-01 22:08:11.767827

TOTAL TIME:  0:01:54.049539

Jul 01 '23 20:07 abubelinha

Thank you for sharing!

I just updated my benchmark.py code + made a run with the latest libraries ... and I realized that it's super hard to understand :see_no_evil: I need to simplify it to make it easier to extend.

Jul 02 '23 07:07 MartinThoma

I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: https://github.com/py-pdf/benchmarks/commit/a78f609d3d4d0d6a298c72c00ddc05b4d35fce53

Jul 02 '23 09:07 MartinThoma

@abubelinha Did you use https://pypi.org/project/pdfrw/ or https://pypi.org/project/pdfrw2/?

Jul 02 '23 09:07 MartinThoma

EDIT: sorry, I misunderstood (thought you were asking about pypdf)

I asked this question and then used this installation: pip install pdfrw2

So my current output says it's sarnold's pdfrw version 0.5.0 But my original test was done in February 20th 2023, so I guess that used pmaupin's pdfrw, probably version 0.4.0 (that script didn't output versions; I incorporated that code yesterday).

Anyway, differences vs pypdf were pretty much the same.

Jul 02 '23 09:07 abubelinha

Thank you :pray:

I just added pdfrw to my comparision with watermarking speed/file size: commit / rendered output

It's crazy. I did expect that it's faster, but not that much. And I didn't expect a difference in file size.

pdfrw did an awesome job there. Thank you for pointing that out!

Jul 02 '23 09:07 MartinThoma

What about my pypdf code implementation? Do you see something wrong?

I've just noticed that you should definitely apply compression when merging stuff with pypdf. Look at the diff of the README: https://github.com/py-pdf/benchmarks/commit/a78f609d3d4d0d6a298c72c00ddc05b4d35fce53

I edited my above code

I avoid now counting pages multiple times (moved "len(PdfReader(fsource).pages)" out of the for loop).
I added a boolean compress parameter to my pypdf-based function, and copied your compression loop like this:

	if compress: # Compress the data
		for page in merger.pages:
			page.compress_content_streams()  # This is CPU intensive!

Do you see something else to change on my pypdf usage?

Change 1 reduced total pypdf time in a few seconds (still not using compression: compress=False)

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.59 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 193.77 MB
Total time: 93.47 seconds

Change 2 didn't work as expected: Oddly, when using compress=True, my results were even worse. Not only because of the much longer time delay (which I would expect since it now runs the compression loop). It also produced a bigger output size:

 pdfrw.__version__ 0.5.0
Accumulated output file size: 50.73 MB
Total time: 4.52 seconds

 pypdf.__version__ 3.2.0
Accumulated output file size: 203.83 MB
Total time: 237.30 seconds

So I'd say there must be something wrong either in my pypdf code or in the pypdf compression algorithm. Perhaps compression is only useful for some kind of PDF files? (i.e. those including uncompressed images or something alike). But I used the same pdf files proposed for your benchmark tests, so I am surprised that "compressed" are bigger than "uncompressed".

Indeed, total times vary a little bit between identical script runs (I guess it depends on other non-python processes running in my computer).

Jul 02 '23 17:07 abubelinha

I've just noticed that you used the same files as I did in my benchmark. Nice!

Jul 08 '23 14:07 MartinThoma

benchmarks benchmarks copied to clipboard

pdfrw vs pypdf page extraction & merge

Code:

OUTPUT:

benchmarks
benchmarks copied to clipboard