pdfjs icon indicating copy to clipboard operation
pdfjs copied to clipboard

Can't append certain PDFs

Open marcodafonseca opened this issue 4 years ago • 8 comments

A client of mine has found some more issues in the PdfJs library.

This is kind of a continuation of my previously logged issue (#209) in a broad sense, but it is failing at a different point.

We are experiencing two different errors here.

First problem: When scanning a file from a printer's feeder straight into PDF we get the following error:

Error: Invalid xref: xref expected but not found at Function.parseXrefObject (...\node_modules\pdfjs\lib\object\xref.js:110:13) at Function.parse (...\node_modules\pdfjs\lib\object\xref.js:67:19) at Parser.parse (...\node_modules\pdfjs\lib\parser\parser.js:45:29) at new ExternalDocument (...\node_modules\pdfjs\lib\external.js:17:12) at Object.exports.assembleAppendToForm (...\assembler.js:171:36) at processTicksAndRejections (internal/process/task_queues.js:93:5) at async Object.exports.handler (...\app.js:46:9)

Here is an example file that is having this problem: Scan_20200512 (2).pdf

Second problem: I'm not sure yet how to reproduce this problem (I will update the ticket as soon as I figure this out), but I did some digging and it looks like the issue is lying in the external.js file at line number 20 when trying to set the pages object.

const pages    = catalog.get('Pages').object.properties

Error message at this line:

TypeError: Cannot read property 'object' of undefined at new ExternalDocument (...\node_modules\pdfjs\lib\external.js:20:42) at Object.exports.assembleAppendToForm (...\assembler.js:171:36) at processTicksAndRejections (internal/process/task_queues.js:93:5) at async Object.exports.handler (...\app.js:46:9)

While investigating I noticed that, in this particular instance, the Pages object is actually residing in

parser.trailer.get("Info").object.properties.get("Pages").object.properties

I wanted to issue a pull request with this particular fix, but while testing it I noticed that this new "Pages" object doesn't have all the properties within it that the catalog has. Namely the "Kids" property which seems to be used quite a bit, but I don't see another way to substitute this value. So this change seemed to have quite a knock-on effect. What is the purpose of Kids? Is it purely just to get a page count at the end of the day?

As things presently stand for this second issue the only file I have that is experiencing this issue is a production file with sensitive client information in it, so I'm unfortunately not at liberty to share this document here. However, as soon as I figure out how to reproduce this error I will provide a test-safe document to this issue if the problem hasn't been fixed yet.

Please let me know if you need anything more from me as this is quite a big problem for us in production at the moment and I'm willing to aid in resolving this issue in any way I can.

Edit: Added copies of error messages to the second problem

marcodafonseca avatar May 12 '20 11:05 marcodafonseca

Hi

My client has given me permission to share with you the document that we are having problems with in the aforementioned second scenario. Here it is CCF12052020.pdf

marcodafonseca avatar May 14 '20 09:05 marcodafonseca

Hi

My client has given me permission to share with you the document that we are having problems with in the aforementioned second scenario. Here it is CCF12052020.pdf

Should be fixed with https://github.com/rkusa/pdfjs/issues/212

I am still looking into the issue of the other file.

rkusa avatar May 14 '20 15:05 rkusa

Should be fixed, please try out version 2.3.7. Thanks for the report as well as for the initial investigation 👍 Please let me know if it is fixed

rkusa avatar May 14 '20 15:05 rkusa

Should be fixed, please try out version 2.3.7. Thanks for the report as well as for the initial investigation 👍 Please let me know if it is fixed

Hi. Thanks for the work! I'm doing some quick preliminary tests and things are looking good so far.

However, I just noticed I sent you the wrong file for describing the 2nd problem. Sorry about that. Here is the correct file: Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf

marcodafonseca avatar May 15 '20 16:05 marcodafonseca

@marcodafonseca Do you know which program crates those PDFs? Because it apparently has a quite severe syntax error:

% java -jar test/preflight-app-2.0.19.jar Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf
The file Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 11:0 points to wrong object: 10:0

While most PDF readers seem to notice and fix the xref pointing to the wrong object, I never planed to make pdfjs that permissive towards PDF syntax errors. So I am afraid that I am not investing time in this case myself, sorry. I might consider accepting a PR though if the fix isn't too invasive in comparison to the fact that this is a PDF syntax error.

rkusa avatar May 22 '20 12:05 rkusa

Do you know which program crates those PDFs?

I'm not sure, hey. To my knowledge the file was potentially scanned using a flatbed scanner of some kind.

Thanks for looking into this, though.

marcodafonseca avatar May 22 '20 12:05 marcodafonseca

I'm going to try my hand at putting together a fix that isn't too invasive

marcodafonseca avatar May 22 '20 12:05 marcodafonseca

Please excuse my deleted comment. I had a mistake in how I was converting Buffers. This seems to be working as expected. Thanks!

DaddyWarbucks avatar Jul 05 '22 20:07 DaddyWarbucks