pdfjs
pdfjs copied to clipboard
Cannot catch error in asBuffer()
I'm currently merging around 1500 PDFs and tried to find the defect one, but I cannot catch errors produced by this.end()
in asBuffer()
.
While errors get caught here:
try {
doc.pipe(fs.createWriteStream(fullPdfPath));
await doc.end();
} catch (err) { console.error(err); }
The node process quits with an unhandled error here:
try {
const buf = await doc.asBuffer();
fs.writeFileSync(fullPdfPath, buf, { encoding: 'binary' });
} catch (err) { console.error(err); }
I'm pretty sure the reason for the uncaught error is this line: https://github.com/rkusa/pdfjs/blob/3374d1ff1142d16e47a10dac2ba93a3f0f161a35/lib/document.js#L636
It should probably be:
if (shouldEnd) {
this.end().catch(reject)
}
Interesting side fact:
PDFs throwing errors like Invalid xref object at 54524
or Name must start with a leading slash, found: 0
are single-page PDFs previously extracted by pdfjs from other multi-page PDFs. Extracting worked, but merging again failed.
I could get rid of the Invalid xref object error by extracting with asBuffer() writeFileSync and encoding binary instead of pipe and stream but the one PDF with Name must start with a leading slash, found: 0
drives me crazy.
I'm currently merging around 1500 PDFs
This is something pdfjs can handle, besides the described issue with a defective one in between?
I'm pretty sure the reason for the uncaught error is this line:
https://github.com/rkusa/pdfjs/blob/3374d1ff1142d16e47a10dac2ba93a3f0f161a35/lib/document.js#L636
It should probably be:
if (shouldEnd) { this.end().catch(reject) }
I think you are right.
Interesting side fact: PDFs throwing errors like
Invalid xref object at 54524
orName must start with a leading slash, found: 0
are single-page PDFs previously extracted by pdfjs from other multi-page PDFs. Extracting worked, but merging again failed.I could get rid of the Invalid xref object error by extracting with asBuffer() writeFileSync and encoding binary instead of pipe and stream but the one PDF with
Name must start with a leading slash, found: 0
drives me crazy.
Are you able to provide a small example to repo either or both errors?
This is something pdfjs can handle, besides the described issue with a defective one in between?
Yes, and it's blazingly fast ;-)
First I had to extract the 1500 single pages from around 25 different multi page PDFs and also needed them as JPG files. This took a little time, mostly because of the image extraction:
const jpgScale = 5;
for (const filename of fs.readdirSync(multiPagesPdfDirectory)) {
if (!filename.endsWith('.pdf')) continue;
const prefix = path.basename(filename, '.pdf');
const filepath = path.join(multiPagesPdfDirectory, filename);
const src = new pdfjs.ExternalDocument(fs.readFileSync(filepath));
for (let num = 1; num <= src.pageCount; num ++) {
const pdfFilepath = path.join(singlePagesPdfDirectory, `${prefix}-${String(num).padStart(4, '0')}.pdf`);
const jpgFilepath = pdfFilepath + '.jpg';
const doc = new pdfjs.Document();
doc.addPageOf(num, src);
//This created some invalid PDFs (Error: Invalid xref object at 54524 - only noticed when merging again in the next code block)
// doc.pipe(fs.createWriteStream(pdfFilepath));
// await doc.end();
//This created mostly valid PDFs (except: Name must start with a leading slash, found: 0 - only noticed when merging again in the next code block)
await doc.asBuffer().then(data => fs.writeFileSync(pdfFilepath, data, { encoding: 'binary' }));
const image = (await convert(pdfFilepath, { scale: jpgScale }))[0]; //pdf-img-convert
const jpg = sharp(image, { failOn: 'none' })
.flatten({ background: '#ffffff' })
.toColourspace('srgb')
.jpeg({ quality: 85, progressive: true });
await jpg.toFile(jpgFilepath);
}
}
Then I had to merge all of them into a single PDF file. This took under 1 second, only issue could have been memory (especially when automatically retrying and not destroying a failed writeStream):
const doc = new pdfjs.Document();
for (const filename of fs.readdirSync(singlePagesPdfDirectory)) {
if (!filename.endsWith('.pdf')) continue;
const filepath = path.join(singlePagesPdfDirectory, filename);
try {
const src = fs.readFileSync(filepath);
const ext = new pdfjs.ExternalDocument(src);
doc.addPagesOf(ext);
} catch(err) {
const { data, info } = await sharp(filepath + '.jpg, { failOn: 'none' }).toBuffer({ resolveWithObject: true });
const width = info.width / jpgScale;
const height = info.height / jpgScale;
const pdf = pdfmake.createPdfKitDocument({
pageSize: { width, height },
pageOrientation: 'portrait',
pageMargins: [0, 0, 0, 0],
content: [{
image: data,
left: 0,
top: 0,
width: width,
height: height,
}],
});
const buf = await new Promise((resolve, reject) => {
const chunks = [];
pdf.on('data', chunk => chunks.push(chunk));
pdf.on('end', () => resolve(Buffer.concat(chunks)));
pdf.on('error', reject);
pdf.end();
});
const ext = new pdfjs.ExternalDocument(buf);
doc.addPagesOf(ext);
}
}
doc.pipe(fs.createWriteStream(fullPdfPath));
await doc.end();
As I was in a rush, I added a quick fix falling back to the extracted image. But this only helped with "Invalid xref object at 54524" errors as they occurred while reading the single page PDFs. The "Name must start with a leading slash, found: 0" error occurred while writing the fully merged PDF, this is where I could not catch the error to find out which page.
Also trying to repair the affected PDFs (after narrowing down which single page was actually to blame) did not help.
Lots of unnecessary code but I thought you might be interested in how I used your library.
Are you able to provide a small example to repo either or both errors?
Unfortunately the repo is private, but I'll send you example PDFs via email, that you can use along my code above, as soon as I have time to find the relevant files.
I'm having the exact same error with a few different PDF files which I can send you privately. I'm using latest pdfjs 2.5.0
My real code downloads two file buffers from external pdf, combines them, and then throws the unhandled error while converting the combined document to a buffer.
Here's my minimal repro code:
import * as pdfjs from 'pdfjs';
import * as https from 'https';
(async () => {
try {
console.log('downloading...');
const pdfBuffer = await downloadExternalReport('contact me for URLs');
const doc = new pdfjs.ExternalDocument(pdfBuffer);
const outputDoc = new pdfjs.Document();
outputDoc.addPagesOf(doc);
console.log('converting to buffer...');
const outBuffer = await outputDoc.asBuffer(); // <- error thrown here but not caught
console.log('done!');
} catch (e) {
console.error('error combining PDFs', e);
}
})();
function downloadExternalReport(url: string) {
const data: Buffer[] = [];
return new Promise<Buffer>((resolve, reject) => {
const request = https.get(url, (response) => {
if (response.statusCode !== 200) {
reject('Error downloading external report');
} else {
response.on('data', (d: Buffer) => data.push(d));
response.on('end', () => resolve(Buffer.concat(data)));
}
});
request.on('error', reject)
})
}
One file gives me this error Invalid value
:
2023-07-20 11:00:29 info: downloading...
2023-07-20 11:00:30 info: converting to buffer...
2023-07-20 11:00:30 error: (node:5764) UnhandledPromiseRejectionWarning: Error: Invalid value
at Lexer._error (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\lexer.js:152:11)
at Object.exports.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\value.js:26:9)
at Function.parseInner (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:80:28)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:68:27)
at parseObject (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:128:22)
at PDFReference.get [as object] (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:15:17)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:68:35)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:84:18)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:75:16)
at ExternalDocument.write (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\external.js:63:14)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at Document.end (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\document.js:544:5)
(Use `node --trace-warnings ...` to show where the warning was created)
2023-07-20 11:00:30 error: (node:5764) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
2023-07-20 11:00:30 error: (node:5764) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Another gives me this error Name must start with a leading slash, found: 0
:
2023-07-20 11:03:58 info: converting to buffer...
2023-07-20 11:03:58 error: (node:34740) UnhandledPromiseRejectionWarning: Error: Name must start with a leading slash, found: 0
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\name.js:67:13)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\dictionary.js:71:27)
at Object.exports.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\value.js:20:30)
at Function.parseInner (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:80:28)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:68:27)
at parseObject (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:128:22)
at PDFReference.get [as object] (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:15:17)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:68:35)
at C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:89:18
at Array.forEach (<anonymous>)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:88:15)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:76:16)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:72:16)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:84:18)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:75:16)
at ExternalDocument.write (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\external.js:63:14)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at Document.end (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\document.js:544:5)
2023-07-20 11:03:58 error: (node:34740) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
2023-07-20 11:03:58 error: (node:34740) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
And another gives me this error Name must start with a leading slash, found: (
:
2023-07-20 11:06:21 info: converting to buffer...
2023-07-20 11:06:21 error: (node:16452) UnhandledPromiseRejectionWarning: Error: Name must start with a leading slash, found: (
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\name.js:67:13)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\dictionary.js:71:27)
at Object.exports.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\value.js:20:30)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\dictionary.js:74:30)
at Object.exports.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\value.js:20:30)
at Function.parseInner (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:80:28)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\object.js:68:27)
at parseObject (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:128:22)
at PDFReference.get [as object] (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\reference.js:15:17)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:68:35)
at C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:89:18
at Array.forEach (<anonymous>)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:88:15)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:84:18)
at Function.addObjectsRecursive (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\parser\parser.js:75:16)
at ExternalDocument.write (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\external.js:63:14)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
at Document.end (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\document.js:544:5)
2023-07-20 11:06:21 error: (node:16452) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
2023-07-20 11:06:21 error: (node:16452) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
Interesting side fact: PDFs throwing errors like
Invalid xref object at 54524
orName must start with a leading slash, found: 0
are single-page PDFs previously extracted by pdfjs from other multi-page PDFs. Extracting worked, but merging again failed.
In my case, all three of my files were previously generated by pdfjs
as well. My first process is to generate Report A from html using wkhtmltopdf
, then I append a few 3rd party PDFs (invoices, etc), to that report using pdfjs
and save that, then upload to S3 - all good. My next step is to generate Report B and upload that to S3. Then, later on I download both files from S3 and append Report B to the end of Report A so I can upload the combined reports to a 3rd party API.
This error gets triggered when exporting the combined report to a buffer, even though both files were previously exported to buffer by pdfjs
. So maybe something is caused by a file being exported twice. However, I guess that is separate issue to this one of the error not being handle correctly - maybe #217 #166
The unhandled promise rejection error should be fixed on main
. I've also added a test for adding a PDF generated by pdfjs, and that worked fine. So it generally seems to work (as in pdfjs generates PDF it itself deems valid), except when it does not. I don't know exactly yet what causes it to not work in both of your cases. Kinda doesn't make sense that adding a PDF works the first time, but doesn't when adding that generated PDF again later on.
I am afraid though that this issue isn't very high on my list, since it does not affect my own use-case. So that you can plan, you should know that I don't expect to work on that in the foreseeable future.
Thanks @rkusa. Do you have an ETA on when you can publish the unhandled promise rejection fix to npm?
@wildhart just released as 2.5.1
I've tried 2.5.1 and I'm afraid I still get the same unhandled errors as before.
If I edit your code directly in my node_modules
folder, and make the change to document.js
as suggested by @7freaks-otte, (combined with your new return
statement):
if (shouldEnd) {
return this.end().catch(reject)
}
Then the error is properly caught and handled by my own error handler:
...
console.log('converting to buffer...');
const outBuffer = await outputDoc.asBuffer();
console.log('done!');
} catch (e) {
console.error('error combining PDFs', e);
}
2023-07-24 11:15:17 info: converting to buffer...
2023-07-24 11:15:17 error: error combining PDFs Error: Name must start with a leading slash, found: 0
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\name.js:70:13)
at Function.parse (C:\Users\cmori\projects\Sourcecode\ASCP Web App\backend\node_modules\pdfjs\lib\object\dictionary.js:72:27)
Well, ... 🤦♂️ – apparently returning a promise to chain it was only a thing inside of a then()/catch()
. I gave it some more time and added a unit test to be sure this time. Thanks for the feedback. Released as 2.5.2.
@rkusa thanks for the toBuffer error fix.
As for the Name must start with a leading slash
error, I understand your priorities.
I identified the single page PDF causing this error and I'll send it to you via mail. You can just use it with your newly written test as addallpages.pdf
to reproduce the error. Maybe you can easily identify the problem once you find a little time.
@rkusa if it helps, do you accept Github sponsorship?
My client uses this in a commercial environment and this bug is costing them time (when we come across this error the only solution is to "reprint" the offending PDF, then we are able to append it to our own PDFs). I was going to see if I could investigate it myself if I could find any time.
So if you could fix this, you'd save us time and therefore $$, so would be happy to send something your way...
@wildhart If it helps, I can send you our failing single PDF (60-100KB) as well.
I was able to repair the PDF via iLovePDF and was then able to use addPagesOf
without error.
I just found out, that they have a NodeJS library (https://github.com/ilovepdf/ilovepdf-nodejs) to access their API. Feels like a really ugly workaround but I was thinking of implementing this on failing PDFs.
@wildhart I appreciate the offer, but pdfjs
is too low on my list of priorities to accept $ with a good conscious.
Anyway, the documents @7freaks-otte send over made it very easy for me to spot the issue. Thanks a lot for narrowing it down to a single page @7freaks-otte!
I've just pushed a fix. However, adding pages of pdfjs
generated PDFs that are already broken isn't fixed. Just newly generated PDFs with pdfjs
should work now when being added again.
Mind checking main
and confirming that it is fixed before I publish a new version?
I've tried installing your latest pdfjs direct from github, but I continue to get "Name must start with a leading slash, found: (" with some files.
Also, with another file I get your new error "Tried to write reference with null
object id". What does this mean, and how can we avoid it?
I've sent you two files by email...
@rkusa Thank you very much, I'll try to test your fix the next days and give you feedback.
@wildhart Thanks for testing. To be sure, my fix prevents that pdfjs
generates invalid PDFs (at least one instance). If you already have an invalid PDF, and try to add it to a new document, you'll still see the error. So maybe you are trying to add PDFs previously generated with pdfjs
that are already broken?
The error Tried to write reference with null object id
is a new addition to prevent generating such invalid PDFs in the first place. You might have encountered another instance where pdfjs
would generate a broken PDF. Thanks for sending it over, I'll look at it.
So maybe you are trying to add PDFs previously generated with
pdfjs
that are already broken?
In the example I sent you "Tax-Invoice-M590936.pdf" that file was not generated by pdfjs (at least not by me) - that file was uploaded by one of our clients and triggers the "Name must start with a leading slash, found: (" error when appended to a pdf using pdfjs, then that pdf is appended to another pdf.
@rkusa sorry for the delay, I was quite busy the last weeks.
Your commit https://github.com/rkusa/pdfjs/commit/b6cdd70c64611d0e1369ad928028b2cf51009379 seems to fix the Name must start with a leading slash, found: 0
error but same as @wildhart I now encounter the new TypeError: Tried to write reference with 'null' object id
on the same page.
Maybe its worth noting that I just want to add a single page (3) from the previously generated PDF.
What worked for me (though not practical) is:
- First I join several PDFs to a single one with your new unreleased version => this fixes the
leading slash
error. - Second I extract single pages from the before generated PDF with the current released 2.5.2 version => this does not know the
null object id
error
Just FYI, I've moved way from using pdfjs for merging PDFs, due to this issue with certain PDFS causing errors, and also excessive file sizes (#314).
Instead I'm using pdf-lib which is really easy to use to copy pages from one PDF to another, and it doesn't have any problems with the files we've provided here which throw errors in pdfjs, and the output file size is never bigger than the original files. It also seems a bit faster.
I'm still using pdfjs to generate PDF from html, but then I use pdf-lib to combine that with other PDF files.
In the example I sent you "Tax-Invoice-M590936.pdf" that file was not generated by pdfjs (at least not by me) - that file was uploaded by one of our clients and triggers the "Name must start with a leading slash, found: (" error when appended to a pdf using pdfjs, then that pdf is appended to another pdf.
File works for me with the previous fix – not sure if it is a specific constellation on how it is added to the file.
Your commit b6cdd70 seems to fix the
Name must start with a leading slash, found: 0
error but same as @wildhart I now encounter the newTypeError: Tried to write reference with 'null' object id
on the same page.
This error was added as part of the fix to prevent pdfjs
to generate invalid PDFs in similar situations – and you seem to have found another one. However, I don't think that I'll find the time to look into that – sorry.
Just FYI, I've moved way from using pdfjs for merging PDFs, due to this issue with certain PDFS causing errors, and also excessive file sizes (#314).
Instead I'm using pdf-lib which is really easy to use to copy pages from one PDF to another, and it doesn't have any problems with the files we've provided here which throw errors in pdfjs, and the output file size is never bigger than the original files. It also seems a bit faster.
Sounds like a good decision to me. I've also added a note about the current maintenance status to the README. I myself moved most of my uses of pdfjs to a simple HTML to PDF via headless Chrome (I don't have the use-case of adding other PDFs anymore).
For the moment I'm OK with my workaround above using 2 pdfjs versions at a time, as the PDFs are only genertated once in a several months. I understand your priorities. Thanks for your help anyway @rkusa