unpdf icon indicating copy to clipboard operation
unpdf copied to clipboard

Strange behavior of `getDocumentProxy`'s buffer when extracting text AND rendering page as image (only for some pdf)

Open ndrbrt opened this issue 1 year ago • 4 comments

Environment

node v20.11.1 unpdf v0.11.0

Reproduction

I got the original error in a server route of a Nuxt 3 project. Also, in the original app I performed other operations besides text/metadata extraction and image rendering.

Anyway, I prepared a new Nitro project for this issue and isolated only the error involved. You can find the repo here: https://github.com/ndrbrt/unpdf-issue

Describe the bug

First of all, I noticed the issue only for some pdfs (actually pdfs with images, but I don't know if it's something comparable to #4, nor if it only affects pdfs with images).

Error A

The original code was similar to that in server/api/error-a.ts.

If you run the dev server and open, e.g.:

  • http://localhost:3000/api/error-a?url=https://github.com/raphink/geneve_1564/releases/download/2015-07-08_01/geneve_1564.pdf

You get the following error:

[nitro] [request error] [unhandled] Cannot read properties of undefined (reading 'createCanvas')
  at i.constructor._createCanvas (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1552904)
  at i.constructor.create (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1399305)
  at CachedCanvases.getCanvas (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1474861)
  at CanvasGraphics.beginGroup (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1502437)
  at CanvasGraphics.executeOperatorList (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1482511)
  at InternalRenderTask._next (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1591245)
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

However, as I said, if you pass some other pdfs, everything's fine, e.g.:

  • http://localhost:3000/api/error-a?url=https://bitcoin.org/bitcoin.pdf

Working version

Now, the only way I was able to solve the problem is as in server/api/working.ts: I copied the original buffer before it was passed to getDocumentProxy and then passed the copied buffer to renderPageAsImage. You can see that both requests succeed:

  • http://localhost:3000/api/working?url=https://bitcoin.org/bitcoin.pdf
  • http://localhost:3000/api/working?url=https://github.com/raphink/geneve_1564/releases/download/2015-07-08_01/geneve_1564.pdf

Error B

I also tried another approach in server/api/error-b.ts, passing a new Uint8Array(buffer) directly to renderPageAsImage. This way, if you open:

  • http://localhost:3000/api/error-b?url=https://github.com/raphink/geneve_1564/releases/download/2015-07-08_01/geneve_1564.pdf

You get this error:

[nitro] [request error] [unhandled] Unable to deserialize cloned data.
  at LoopbackPort.postMessage (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1573782)
  at MessageHandler.sendWithPromise (./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1514035)
  at ./node_modules/.pnpm/[email protected]/node_modules/unpdf/dist/pdfjs.mjs:1:1561726
  at process.processTicksAndRejections (node:internal/process/task_queues:95:5)

Interestingly, in this case, if you repeat the request disabling text extraction (note the query param), it works:

  • http://localhost:3000/api/error-b?url=https://github.com/raphink/geneve_1564/releases/download/2015-07-08_01/geneve_1564.pdf&text=false

Additional context

I did not use the official PDF.js build, because I couldn't get it to work. I still tried using the default build from unpdf and everything worked fine until I noticed the mentioned problem.

Logs

No response

ndrbrt avatar Aug 29 '24 11:08 ndrbrt

Hi there! Thanks for the thourough issue description. One question: How did you deploy the app? Canvas support is only possible in Node deploy targets.

johannschopplich avatar Oct 02 '24 09:10 johannschopplich

Hi @johannschopplich, I deployed the app on Vercel using the default config as in https://nuxt.com/deploy/vercel. (It works the same way both on Vercel and locally)

ndrbrt avatar Oct 02 '24 10:10 ndrbrt

I see. It's probably not gonna work on Vercel, since the canvas module requires Node.js bindings.

For your other examples: Please use the official PDF.js build, because the serverless build (used by unpdf by default) has stripped the canvas support. Can you please follow the renderPageAsImage guide to set up the pdfjs-dist build used together with canvas?

import { configureUnPDF, renderPageAsImage } from "unpdf";

await configureUnPDF({
  // Use the official PDF.js build
  pdfjs: () => import("pdfjs-dist"),
});

const result = await renderPageAsImage(pdf, 1, {
  canvas: () => import("canvas"),
});

johannschopplich avatar Oct 02 '24 10:10 johannschopplich

Actually I did try to use pdfjs-dist, but it resulted in an error.

yarn add pdfjs-dist
await configureUnPDF({
  // Use the official PDF.js build
  pdfjs: async () => await import('pdfjs-dist'),
})
 ERROR  [nuxt] [request error] [unhandled] [500] Resolving failed. Please check the provided configuration.
  at resolvePDFJSImports (./node_modules/unpdf/dist/index.mjs:33:13)
  at async configureUnPDF (./node_modules/unpdf/dist/index.mjs:179:5)
  at Object.handler (./server/api/test.ts:5:1)
  at async ./node_modules/h3/dist/index.mjs:1975:19
  at async Object.callAsync (./node_modules/unctx/dist/index.mjs:72:16)
  at async Server.toNodeHandle (./node_modules/h3/dist/index.mjs:2266:7)

ndrbrt avatar Oct 10 '24 14:10 ndrbrt