PdfPig Consolidate document / page modification approach

There has been some work recently on splitting / merging PDF documents (see #248, #254, #262) which use the PdfMerger class and are very useful. However there is already PdfDocumentBuilder and PdfPageBuilder classes for building new documents. Additionally there is work in #250 to allow pages to be copied from a PdfDocument Page to a PdfPageBuilder which would accomplish the same splitting / merging functionality.

I think the approach in #250 is more general purpose and the PdfMerger could simply be a helper that uses this functionality internally. This mirrors the approach of PdfSharp for splitting / merging (open a document, copy pages you are interested in to a new document). It also has significant benefits going forward for scenarios where conditional splitting / merging needs to occur. With the current PdfMerger you would have to:

Open/parse the documents using PdfDocument / Pages
Build page list matching a condition (eg. pages with word X)
Feed the page list to PdfMerger which then re-opens and parses the documents a second time and outputs the final PDF

If the approach from #250 is used, you would analyze the parsed Page from a PdfDocument and add it to the PdfDocumentBuilder if it matches the conditions you set (only loading and parsing once).

I know current implementation of PdfDocumentBuilder and PdfPageBuilder do not fully meet these editing requirements of the PdfMerger but if effort is being completed I think bringing PdfDocumentBuilder and PdfPageBuilder up to parity and having everything in a single place would be beneficial.

What do you think @EliotJones @InusualZ @Poltuu ?

Jan 15 '21 19:01 plaisted

See pseudo example api below:

using var doc = PdfDocument.Open(file);
var builder = new PdfDocumentBuilder();

// current existing, create blank new page
var pageBuilder = builder.AddPage(PageSize.A4);

// new: from parsed page, same as work in PR #250
// just uses PdfBuilder instead of PdfPageBuilder
var page = doc.GetPage(1);
var pageBuilder2 = builder.AddPage(page);

// new: add page(s) from opened PdfDocument -> optimized to not fully parse page (eg. leave content stream unparsed)
// similar functionality / performance to PdfMerger but allows modifications using PdfPageBuilder
var pageBuilder3 = builder.AddPage(doc, 1); // copy page 1 from source doc

Jan 16 '21 18:01 plaisted

I totally agree. Having the proper apis on the PdfDocumentBuilder seems to me the best way in terms of both performances and functionality for complex pdf creation, since there is many scenario where you need to read the pages anyway to know if you want it in your created document. Plus, the pages need to be partially read to be added, as you mention.

I don't know how far is the code to implement this feature this way, I might give it a go if I find some time.

As to refactor PdfMerger with this logic, I'll the code owners decide on the matter.

Jan 18 '21 08:01 Poltuu

Sorry for my lack of engagement with the project recently, I'm just taking a bit of a pause but I'm glad people are discussing/thinking about the future/features.

I think you're on the right track for how the editing/creating APIs should be unified and it makes sense to use the document builder for that. I think it also makes sense to have static convenience API wrappers for modifying that internal structure PdfMerger/PdfSplitter/etc which just call into the single editing API.

My main challenge when I was looking at this is handling "PDF nonsense" where there is content in the original document that isn't supported by PdfPig currently, like JavaScript or whatever. We can take the approach of PdfMerger where we copy the tokens and rewrite the indirect references, but I just get intimidated by the task and the unknown-unknowns.

It has been the intent to unify editing into the builder for a long while https://github.com/UglyToad/PdfPig/issues/27 but my focus/interest was mainly on reading and given my lack of time I've never made a big effort to actually do it but I think we can re-use a lot of the great work by @InusualZ for merging.

Jan 18 '21 14:01 EliotJones

Do we want to just merge #250 and start building out from there? I can put some work in around lazy copying into PdfDocumentBuilder but would build on top of the work in #250.

Jan 18 '21 15:01 plaisted

I totally agree with all of the above. The thing that I have been struggling the most is resource management. I have been trying to come up with this API since I started to implement the PdfMerger class. Because as of right know the PdfMerger do what most library do, copying blindly without caring about the resource bloat. I came up #193 and that have been my best idea, but I don't know if it's flexible and performant enough. Any ideas are welcome,

Related to #250, It's functional but it's kind of proof of concept. Because, It suffers from the same problems. So, I would like to improve it a lot more, so it doesn't suffer from resource bloat.

I really like this library, but it's missing the editing side. If we manage to get it on par with the reading part, this library would kick some ***

Any question, concern, etc... You can ask here or on Gitter. I may not answer right away, but sooner or later I would answer :)

Jan 18 '21 18:01 InusualZ

@InusualZ I experimented with some resource management stuff on top of your work and found it to work well. General approach was to hash the contents of indirect objects as they are being written, and then compare newly written objects to the existing hashes to see if they already exist. One caveat was this doesn't work for cyclic references, but most of the resources (XObjects) that you'd want to dedup wouldn't have cyclic references.

The basic logic for deduplication when adding a indirect object token (implemented in PdfStreamWriter for my experiments):

        private readonly Dictionary<IndirectReference, byte[]> tokens = new ();
        private readonly Dictionary<byte[], IndirectReferenceToken> hashes = new (new FNVByteComparison());
        private MemoryStream ms = new MemoryStream();
        private IndirectReferenceToken AddToken(IToken token)
        {
            ms.SetLength(0);
            TokenWriter.WriteToken(token, ms);
            var contents = ms.ToArray();
            if (hashes.TryGetValue(contents, out var value))
            {
                return value;
            }

            var reference = new IndirectReference(CurrentNumber++, 0);
            var referenceToken = new IndirectReferenceToken(reference);
            
            tokens.Add(referenceToken.Data, contents);
            hashes.Add(contents, referenceToken);
            return referenceToken;
        }

The hash dictionary is build with a custom IEqualityComparer:

    internal class FNVByteComparison : IEqualityComparer<byte[]>
    {
        public bool Equals(byte[] x, byte[] y)
        {
            if (x.Length != y.Length)
            {
                return false;
            }

            for (var i = 0; i < x.Length; i++)
            {
                if (x[i] != y[i])
                {
                    return false;
                }
            }

            return true;
        }

        public int GetHashCode(byte[] obj)
        {
            var hash = FnvHash.Create();
            foreach (var t in obj)
            {
                hash.Combine(t);
            }

            return hash.HashCode;
        }
    }

When it finally writes the PDF, it just uses the contents of the hash dictionary so we aren't serializing everything twice. I tested it against some large internal documents (20,000+ pages) and it ran extremely quickly, something like 5 seconds to read, combine, and optimize output.

Now regarding the PdfDocumentBuilder, I think it may be easiest to just do very basic optimization when adding pages / content and then when Build() is called pass off the indirect objects to a IPdfFormatter or something similar which is responsible for optimizing further / serializing the document structure. One implementation could be a DeduplicatingPdfFormatter that rebuilds indirect references using hashes as shown above to dedup them and then writes the output. The benefit here is in scenarios where you don't need the hash approach (eg. everything from same source document or built from scratch) you could use a default IPdfFormatter which just writes raw content out and wouldn't pay the overhead of hashing / lookups.

Let me know what you think about the general concept of doing very simple optimizations when adding resources to PdfDocumentBuilder but allowing the full optimizations to be completed when Building / saving the content and keeping this logic outside of the Builder itself.

Jan 18 '21 19:01 plaisted

@plaisted

That sound great!

About the IPdfFormatter for the most part the idea sound great, but I'm not entirely convinced. But, hey let's see some code and if I see something that I don't like I would let you know.

Jan 18 '21 20:01 InusualZ

I've been using PdfPig for a bit now and it is truly a pleasure to work with. The features discussed in this issue are exactly what I'd need to complete my project. Is there anything I can do to accelerate the work done here, despite my rather basic C# skills? Would be awesome to get something like #250 merged and then a new release pushed out.

Feb 05 '21 09:02 XYQuadrat

I'm planning to spend some time this weekend working on it, but guessing it will be a while before anything is merged / released.

Feb 06 '21 02:02 plaisted

Created #279 to allow efficient import of pages into a PdfDocumentBuilder. Has some internal changes to the way PdfDocumentBuilder works and writes data so feeback / review would be appreciated.

Feb 06 '21 23:02 plaisted

Closing this since I want to clear the backlog and I think it's mainly handled now?

Dec 11 '22 19:12 EliotJones