PDFsharp Error when accessing PdfDocument.AcroForm.Fields

I tried to load the attached file and iterate over pdf.AcroForm.Fields. At first I encountered an error:

No appropriate constructor found for type: PdfAcroFieldCollection at PdfSharp.Pdf.PdfDictionary.DictionaryElements.CreateArray(Type type, PdfArray oldArray)

So I added a constructor it was looking for: public PdfAcroFieldCollection(PdfDocument document): base(document){ }

Error "solved", but next one appeared: 'Object already in table.' at PdfSharp.Pdf.Advanced.PdfCrossReferenceTable.Add(PdfObject value)

I tried to change Add to ObjectTable[value.ObjectID]=value.ReferenceNotNull;, then there was no exception, but pdf.AcroForm.Fields were empty.

21a352e0-dd03-4855-a1e5-82fb3690493c.pdf

Steps to reproduce: dotnet new console -n PdfSharpBug cd PdfSharpBug dotnet add package PdfSharp code .

using System.Net;
using PdfSharp.Pdf.Advanced;
using PdfSharp.Pdf.IO;

using (var ms = new MemoryStream(new WebClient().DownloadData("https://github.com/user-attachments/files/17873789/21a352e0-dd03-4855-a1e5-82fb3690493c.pdf")))
using (var pdf = PdfReader.Open(ms, PdfDocumentOpenMode.Import))
{
    foreach (PdfReference fieldReference in pdf.AcroForm.Fields)
    {
        Console.WriteLine(fieldReference.ToString());
    }
}

dotnet run

Observed result: Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object. at PdfSharp.Pdf.PdfDictionary.DictionaryElements.CreateArray(Type type, PdfArray oldArray) at PdfSharp.Pdf.PdfDictionary.DictionaryElements.GetValue(String key, VCF options) at PdfSharp.Pdf.AcroForms.PdfAcroForm.get_Fields() at Program.<Main>$(String[] args) in C:\git\PdfSharpBug\Program.cs:line 8

Nov 22 '24 17:11 eZprava

Please consider using the Issue Submission Template: https://docs.pdfsharp.net/General/Issue-Reporting.html

Nov 25 '24 07:11 ThomasHoevel

I added steps needed to reproduce the NullReferenceException, hope it helps.

Nov 25 '24 10:11 eZprava

When i added the attached document to my test-files and run my usual tests on it, I noticed this message in PDFsharp's log-output:

Error [0]: Object '43 0' already exists in xref table’s references, referring to position 160538. The latter one referring to position 159770 is used. This should not occur. If somebody came here, please send us your PDF file so that we can fix it (issues (at) pdfsharp.net.

So there you have your document, as requested ! 😉

I did some digging and found that object 43 0 is indeed defined twice in the document:

a PdfAcroForm that was added by an incremental update
an XRef-stream in the previous version of the document (before the update)

As the file is read from back to front, the AcroForm-reference is read first. When the XRef-stream is read, the library detects an already existing reference with this ID and (falsely) assumes, the existing one must also be an XRef-stream. But in this case the existing reference is an AcroForm ! The library nonetheless overwrites the file-position of the AcroForm with the position of the XRef-stream. When finally resolving the AcroForm-reference, the XRef-stream is read instead, resulting in an AcroForm with no fields.

I would say, this issue is two-fold:

some software used to apply an incremental update that assigns incorrect object-ids
PDFsharp making wrong assumptions

Dec 15 '24 13:12 packdat