pdfrw
pdfrw copied to clipboard
Any way to detect embedded PDF web capture content?
Hello! In reviewing the "Archivist's PDF Cabinet of Horrors" vs pdfrw, I've found two cases where pdfrw seems to silently drop content when concatenating PDFs:
Case 1: PDF portfolio. This is pretty easy to catch by checking if '/Collection' in reader['/Root']
Case 2: Web capture content. I haven't figured out how to detect this. To see the issue in practice, download https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/webCapture.pdf and then run:
pdf_merger = PdfWriter()
for source in pdf_paths:
reader = PdfReader(source)
page_count += len(reader.pages)
pdf_merger.addpages(reader.pages)
pdf_merger.write(output)
webCapture.pdf (the source content) somehow has two extra pages in it (that I am guessing are the web capture content).
I don't necessarily want pdfrw to render these pages, but I would like to figure out how to identify if a PDF has web capture content so I can raise an exception. This is where I run into my lack of knowledge about the PDF format. I scanned through the PDF 1.7 standard for possible keywords; these both come up true from grep:
grep -i SpiderInfo webCapture.pdf
grep -i URLS webCapture.pdf
I don't know where pdfrw would allow me to access whatever says SpiderInfo
in the PdfReader object though. Is there a method/data structure I could access on a PdfReader
object to find this SpiderInfo
keyword?
This is probably the worst possible way to do this but it looks like '/SpiderInfo ' in reader.source.fdata
works. :-(