pdfrw icon indicating copy to clipboard operation
pdfrw copied to clipboard

Any way to detect embedded PDF web capture content?

Open borwickatuw opened this issue 4 years ago • 1 comments

Hello! In reviewing the "Archivist's PDF Cabinet of Horrors" vs pdfrw, I've found two cases where pdfrw seems to silently drop content when concatenating PDFs:

Case 1: PDF portfolio. This is pretty easy to catch by checking if '/Collection' in reader['/Root']

Case 2: Web capture content. I haven't figured out how to detect this. To see the issue in practice, download https://github.com/openpreserve/format-corpus/blob/master/pdfCabinetOfHorrors/webCapture.pdf and then run:

    pdf_merger = PdfWriter()
    for source in pdf_paths:
        reader = PdfReader(source)
        page_count += len(reader.pages)
        pdf_merger.addpages(reader.pages)
    pdf_merger.write(output)

webCapture.pdf (the source content) somehow has two extra pages in it (that I am guessing are the web capture content).

I don't necessarily want pdfrw to render these pages, but I would like to figure out how to identify if a PDF has web capture content so I can raise an exception. This is where I run into my lack of knowledge about the PDF format. I scanned through the PDF 1.7 standard for possible keywords; these both come up true from grep:

grep -i SpiderInfo webCapture.pdf
grep -i URLS webCapture.pdf

I don't know where pdfrw would allow me to access whatever says SpiderInfo in the PdfReader object though. Is there a method/data structure I could access on a PdfReader object to find this SpiderInfo keyword?

borwickatuw avatar Oct 14 '20 17:10 borwickatuw

This is probably the worst possible way to do this but it looks like '/SpiderInfo ' in reader.source.fdata works. :-(

borwickatuw avatar Oct 14 '20 17:10 borwickatuw