pdf-lib icon indicating copy to clipboard operation
pdf-lib copied to clipboard

file size problem: 433mb generated from a 15mb document

Open juanludlf opened this issue 2 years ago • 9 comments

What were you trying to do?

I'm trying to generate a new PDF document based on an existing one. See How can we reproduce the issue? section to download the original pdf document that is causing this issue.

How did you attempt to do it?

I'm using a code similar to this one:

const pdfBytes = fs.readFileSync("original.pdf");

// Load a PDFDocument from the existing PDF bytes
const inputPdf = await PDFDocument.load(pdfBytes as ArrayBuffer, {
  ignoreEncryption: true,
  parseSpeed: ParseSpeeds.Fastest,
  capNumbers: true
});

// create a new PDFDocument
this.output = await PDFDocument.create();

// get document pages
const pages = await inputPdf.getPages();

for (let pageIndex = 0; pageIndex < pages.length; pageIndex++) {
  const page = pages[pageIndex];
  
  // add new page
  newPage = this.output.addPage(PageSizes.A4);
        
  // embed and scale original page
  const embedPage = await this.output.embedPage(page);
  const scaledPageDims = embedPage.scale(0.75);
        
  newPage.drawPage(embedPage, {
    ...scaledPageDims,
    x: 10,
    y: 10
  });
}

// Serialize the PDFDocument to bytes (a Uint8Array)
const newPdfBytes = await this.output.save();

What actually happened?

The original document is 15 Mb in size and the generated document is 433 Mb.

What did you expect to happen?

I expected to get similar sizes from both the original and the generated document.

How can we reproduce the issue?

The code attached in section How did you attempt to do it? will reproduce this issue.

I think this is an issue specifically with this document, which is based on scanned images.

Version

1.17.1

What environment are you running pdf-lib in?

Node

Checklist

  • [X] My report includes a Short, Self Contained, Correct (Compilable) Example.
  • [X] I have attached all PDFs, images, and other files needed to run my SSCCE.

Additional Notes

No response

juanludlf avatar Nov 02 '22 17:11 juanludlf

Hi @Hopding Can you help me guess what's wrong here? Thank you

juanludlf avatar Nov 30 '22 11:11 juanludlf

Also wondering about this

mrdavidrees avatar Dec 05 '22 12:12 mrdavidrees

Yea, same issue here. Even the simple pages copying increases the result PDF size:

const copyDocument = async (buffer) => {
  console.log('initial size: ', buffer.byteLength); // 20296
  const newPdf = await PDFDocument.create();
  const initialPdf = await PDFDocument.load(buffer);
  const pages = initialPdf.getPages();
  for (let i = 0; i < pages.length; i++) {
    const [newPage] = await newPdf.copyPages(initialPdf, [i]);
    newPdf.addPage(newPage);
  }
  const bufferCopy = await newPdf.save();
  console.log('copy size: ', bufferCopy.byteLength); // 31691
};

SergeiReutov avatar Dec 09 '22 12:12 SergeiReutov

Yes, same issue encountered, 9MB file split with each file 10 page, increase to 60MiB for each sub file.

// split.pdf.js
const fs = require('fs');
const path = require('path');
const { PDFDocument } = require('pdf-lib');

const splitPDF = async (pdfFilePath, outputDirectory) => {
  const data = await fs.promises.readFile(pdfFilePath);
  const readPdf = await PDFDocument.load(data);
  const { length } = readPdf.getPages();

  for (let i = 0, n = length; i < n; i += 10) {
    const writePdf = await PDFDocument.create();
    for (let j = i; j < i + 10; j += 1) {
      const [page] = await writePdf.copyPages(readPdf, [j]);
      writePdf.addPage(page);   
    }
    const bytes = await writePdf.save();
    const outputPath = path.join(outputDirectory, `I100_${i + 1}.pdf`);
    await fs.promises.writeFile(outputPath, bytes);
     
    console.log(`Added ${outputPath}`);
  }
};

splitPDF('100.pdf', 'invoices').then(() =>
  console.log('File have been split!').catch(console.error)
);

ns-sjli avatar Dec 10 '22 00:12 ns-sjli

Have you tried using copyPages instead of embedPage?

 // append to created pdf
  const [copyPage] = await this.output.copyPages(inputPdf, [0])
  this.output.addPage(copyPage)

p-kuen avatar Dec 19 '22 05:12 p-kuen

Hi @p-kuen I will give a try. However, the code samples provided by @SergeiReutov and @ns-sjli use the copyPage method and have the same problem 🤔

juanludlf avatar Dec 19 '22 20:12 juanludlf

Oh sorry, I should've watched more closely. I use copyPages myself and use the trick to put the whole merged pdf into ghostscript for compression, so I never had problems with this one. Not the cleanest solution but effective.

p-kuen avatar Dec 19 '22 23:12 p-kuen

Anybody got any solution on this issue?

vpatil007 avatar Feb 22 '23 22:02 vpatil007

same issue, Anybody got any solution on this issue?

weihuiling071 avatar Feb 28 '24 06:02 weihuiling071