pypdf icon indicating copy to clipboard operation
pypdf copied to clipboard

PDF/A confirmation broken after splitting and creating new pdf documents

Open stojo opened this issue 1 year ago • 16 comments

Since some versions of PyPDF2, the pdf documents that I split and regenerate are loosing PDF/A confirmation (checked with https://avepdf.com/pdfa-validation). Those documents are not accepted by certain applications that check the documents for PDF/A (e.g. DocuSign). It works fine with former versions like PyPDF2 1.28.4.

Maybe helpful (?): The size of the documents split with the newest version of PyPDF2 is less (about 10kb) than files generated with former versions.

Environment

Windows-10-10.0.19042-SP0 PyPDF2==2.10.8

Code (PDFs containing confidential content and therefore not sharable)

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfFileWriter, PdfFileReader

def read_and_split_document():
    # initialize pdf reader
    try:
        print(">> Reading document....")
        inputpdf = PdfFileReader(open("conventions.pdf", "rb"))
        outlines = inputpdf.getOutlines() 
        sites = inputpdf.numPages
        # ...
        # some more code with different operations following (not relevant for this issue)
        # ...
        output = PdfFileWriter()
        for j in range(start,end-1):
                output.addPage(inputpdf.getPage(j))
                with open("documents/"+page_list[i].get("name")+".pdf", "wb") as outputStream:
                    output.write(outputStream)


image

stojo avatar Sep 14 '22 11:09 stojo

@stojo From your code I understand that you are creating from one pdf multiple files and the first file will contains only one page. Although you may not be able to share this one page document, can you at least provide the validation report from the website you've indicated.

pubpub-zz avatar Sep 16 '22 15:09 pubpub-zz

@pubpub-zz yes, I am creating multiple files from one big file. But the first file does not contain only one page. The new files always contain minimum 4 pages.

Here is output from the website: image

And here the full XML error report: image

stojo avatar Sep 22 '22 13:09 stojo

@stojo Do you have any PDF/A compliant document you can share? Can you adjust the example code in such a way that it is minimal and complete (e.g. has all imports and not half of a try-except block)?

MartinThoma avatar Sep 24 '22 03:09 MartinThoma

@stojo Can you recheck with latest version.

pubpub-zz avatar Feb 05 '23 21:02 pubpub-zz

@stojo +1?

pubpub-zz avatar Feb 26 '23 11:02 pubpub-zz

I have also had this problem for a long time and now checked it again with version 3.5.1: The PDF version is now correctly declared as 1.7 (with older PyPDF2 versions it became 1.3). But unfortunately it still does not pass the check on https://avepdf.com/de/pdfa-validation.

Bildschirm­foto 2023-03-07 um 13 18 29

<?xml version="1.0" encoding="UTF-8"?>
<ValidationReport>
    <VersionInformation ID="GdPicture.NET.14" Version="14.2.19" />
    <ValidationProfile Conformance="PDF/A" Part="1" Level="A" />
    <FileInfo FileName="2023-03-05  TEST1_2.pdf" FileSize="10822 bytes" />
    <ValidationResult IsCompliant="False" Statement="PDF file is not compliant with validation profile requirements." />
    <Details>
        <FailedChecks Count="8">
            <Check ID="MissingXMPMetadata" OccurenceCount="1">
                <Occurence Context="Document" Statement="Document XMP metadata is missing." ObjReference="None" />
            </Check>
            <Check ID="MissingMarkInfoDictionary" OccurenceCount="1">
                <Occurence Context="Document" Statement="MarkInfo dictionary is missing." ObjReference="None" />
            </Check>
            <Check ID="MissingStructTreeRootDictionary" OccurenceCount="1">
                <Occurence Context="Document" Statement="StructTreeRoot dictionary not found." ObjReference="None" />
            </Check>
            <Check ID="FileStructureMissingTrailerIDEntry" OccurenceCount="1">
                <Occurence Context="Document" Statement="The file trailer is missing the ID array entry." ObjReference="None" />
            </Check>
            <Check ID="NoCidDSetEntry" OccurenceCount="4">
                <Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
                <Occurence Context="Page" PageNumber="2" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
                <Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="19 0 obj" />
                <Occurence Context="Page" PageNumber="3" Statement="The CIDFont subset font has no CIDSet entry." ObjReference="11 0 obj" />
            </Check>
        </FailedChecks>
    </Details>
</ValidationReport>

geimist avatar Mar 07 '23 12:03 geimist

thank you for sharing this @geimist :heart: I haven't read "14.7 Logical Structure" before.

Here are a few documents that have it:

  • pypdf/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf
  • pypdf/resources/git.pdf
  • pypdf/resources/issue-604.pdf
  • pypdf/resources/issue-914-xmp-data.pdf
  • pypdf/tests/pdf_cache/book_471.pdf
  • pypdf/tests/pdf_cache/BreezeMan1.pdf
  • pypdf/tests/pdf_cache/BreezeMan2.pdf
  • pypdf/tests/pdf_cache/budgeting-loan-form-sf500.pdf
  • pypdf/tests/pdf_cache/GeoBaseWithComments.pdf
  • pypdf/tests/pdf_cache/Giacalone.pdf
  • pypdf/tests/pdf_cache/iss_1134.pdf
  • pypdf/tests/pdf_cache/iss1689.pdf
  • pypdf/tests/pdf_cache/issue_416.pdf
  • pypdf/tests/pdf_cache/PDF32000_2008.pdf
  • pypdf/tests/pdf_cache/pypdf-5536984.pdf
  • pypdf/tests/pdf_cache/st2019.pdf
  • pypdf/tests/pdf_cache/test_write_outline_item_on_page_fitv.pdf
  • pypdf/tests/pdf_cache/tika-906769.pdf
  • pypdf/tests/pdf_cache/tika-911260.pdf
  • pypdf/tests/pdf_cache/tika-914568.pdf
  • pypdf/tests/pdf_cache/tika-918137.pdf
  • pypdf/tests/pdf_cache/tika-923621.pdf
  • pypdf/tests/pdf_cache/tika-934771.pdf
  • pypdf/tests/pdf_cache/tika-935981.pdf
  • pypdf/tests/pdf_cache/tika-941536.pdf
  • pypdf/tests/pdf_cache/tika-942050.pdf
  • pypdf/tests/pdf_cache/tika-953770.pdf
  • pypdf/tests/pdf_cache/tika-959173.pdf
  • pypdf/tests/pdf_cache/tika-959519.pdf
  • pypdf/tests/pdf_cache/tika-972174.pdf
  • pypdf/tests/pdf_cache/tika-972962.pdf
  • pypdf/tests/pdf_cache/tika-980613.pdf
  • pypdf/tests/pdf_cache/tika-988698.pdf
  • pypdf/tests/pdf_cache/tika-992472.pdf
  • pypdf/tests/pdf_cache/tst_iss1631.pdf

And some more:

The MarkInfo almost always just contains {'/Marked': True}, sometimes also '/LetterspaceFlags': 0

MartinThoma avatar Mar 12 '23 09:03 MartinThoma

@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting:

(Optional; PDF 1.4) Text that is an exact replacement for the structure element and its children. This replacement text (which should apply to as small a piece of content as possible) is useful when extracting the document’s contents in support of accessibility to users with disabilities or for other purposes (see 14.9.4, “Replacement Text”).

It sounds as if this might improve the text extraction in some cases a lot.

MartinThoma avatar Mar 12 '23 09:03 MartinThoma

@geimist / @MartinThoma Remember that the PDF/A requires some informations not only within the pages which are not linked and can not be copied in but also within part of the document global. can you try with clone_document_from_reader() if the results are better or not.

pubpub-zz avatar Mar 12 '23 09:03 pubpub-zz

@pubpub-zz I only had a quick glance at "14.7 Logical Structure" so far, but this sounds interesting: (...) It sounds as if this might improve the text extraction in some cases a lot.

thanks for the tip. For the moment I have not been able to find how to use/extract this "replacement text".

pubpub-zz avatar Mar 12 '23 09:03 pubpub-zz

Hi, I'm working on the same proejct as @geimist . I'm not shure if I unerstood it correctly, but I tried this:

from PyPDF2 import PdfReader, PdfWriter

def splitt_pdf(pdf_file_name:str, pages, new_name):
    pdf_file_path = pdf_file_name
    file_base_name = pdf_file_path.replace('.pdf', '')
    pdf = PdfReader(pdf_file_path)
    pdf_Writer = PdfWriter()
    pdf_Writer.clone_document_from_reader(pdf)
    file_out = f"{file_base_name}_{new_name}.pdf"
    with open(file_out, 'wb') as f:
        pdf_Writer.write(f)
        f.close()

So no change at the Document. Just a clone from the reader. The original Document passes the validation, the cloned not.

Gthorsten65 avatar Mar 12 '23 20:03 Gthorsten65

@Gthorsten65 can you provide the original and output file please ?

pubpub-zz avatar Mar 12 '23 20:03 pubpub-zz

Yes and no :-) I will do the same with a dokument with no personla data in it. The I will give you the files. Can I upload them here or how should i do this?

Gthorsten65 avatar Mar 12 '23 20:03 Gthorsten65

ok here they are: Test spiegel_A ist the one that passes the test, Test spiegel A_even fails. The 2nd one is produced with the above code Test_spiegel_A.pdf test_spiegel_A_even.pdf

Gthorsten65 avatar Mar 12 '23 20:03 Gthorsten65

Sorry forget my comments. It is working. The Problem from myside was using pypdf2 :-( With pypdf it is working

Gthorsten65 avatar Mar 12 '23 20:03 Gthorsten65

hmm, ok sorry now I tested it with that what we want to do: And the validation error comes back.

def splitt_pdf(pdf_file_name:str, pages, new_name):
    pdf_file_path = pdf_file_name
    file_base_name = pdf_file_path.replace('.pdf', '')
    pdf = PdfReader(pdf_file_path)
    # pages = [1, 3, 5]  # page 1, 3, 5
    pdf_Writer = PdfWriter()
    pdf_Writer.clone_reader_document_root(pdf)
    #pdf_Writer.clone_document_from_reader(pdf)
    for page_num in pages:
        pdf_Writer.add_page(pdf.pages[page_num-1])
    file_out = f"{file_base_name}_{new_name}.pdf"
    with open(file_out, 'wb') as f:
        pdf_Writer.write(f)
        f.close()

If I just use clone_document_from_reader and then write it to disk, the Dcument Test works. But If I use clone_reader_document_root and add then my needed pages with pdf_Writer.add_page(), write it then to file, the check fails.

Even clone_document_from_reader and then add pages ( from my understanding this is not correct, because I want to add only some pages), the test fails.

So actually the only way is to use clone_document_from_reader. But then I have to much pages, because I want to split one document into 2 Documents.

So do I have a misunderstanding, or whats going wrong on my side?

Gthorsten65 avatar Mar 12 '23 21:03 Gthorsten65