pdfjs
pdfjs copied to clipboard
Can't append certain PDFs
A client of mine has found some more issues in the PdfJs library.
This is kind of a continuation of my previously logged issue (#209) in a broad sense, but it is failing at a different point.
We are experiencing two different errors here.
First problem: When scanning a file from a printer's feeder straight into PDF we get the following error:
Error: Invalid xref: xref expected but not found at Function.parseXrefObject (...\node_modules\pdfjs\lib\object\xref.js:110:13) at Function.parse (...\node_modules\pdfjs\lib\object\xref.js:67:19) at Parser.parse (...\node_modules\pdfjs\lib\parser\parser.js:45:29) at new ExternalDocument (...\node_modules\pdfjs\lib\external.js:17:12) at Object.exports.assembleAppendToForm (...\assembler.js:171:36) at processTicksAndRejections (internal/process/task_queues.js:93:5) at async Object.exports.handler (...\app.js:46:9)
Here is an example file that is having this problem: Scan_20200512 (2).pdf
Second problem: I'm not sure yet how to reproduce this problem (I will update the ticket as soon as I figure this out), but I did some digging and it looks like the issue is lying in the external.js file at line number 20 when trying to set the pages object.
const pages = catalog.get('Pages').object.properties
Error message at this line:
TypeError: Cannot read property 'object' of undefined at new ExternalDocument (...\node_modules\pdfjs\lib\external.js:20:42) at Object.exports.assembleAppendToForm (...\assembler.js:171:36) at processTicksAndRejections (internal/process/task_queues.js:93:5) at async Object.exports.handler (...\app.js:46:9)
While investigating I noticed that, in this particular instance, the Pages object is actually residing in
parser.trailer.get("Info").object.properties.get("Pages").object.properties
I wanted to issue a pull request with this particular fix, but while testing it I noticed that this new "Pages" object doesn't have all the properties within it that the catalog has. Namely the "Kids" property which seems to be used quite a bit, but I don't see another way to substitute this value. So this change seemed to have quite a knock-on effect. What is the purpose of Kids? Is it purely just to get a page count at the end of the day?
As things presently stand for this second issue the only file I have that is experiencing this issue is a production file with sensitive client information in it, so I'm unfortunately not at liberty to share this document here. However, as soon as I figure out how to reproduce this error I will provide a test-safe document to this issue if the problem hasn't been fixed yet.
Please let me know if you need anything more from me as this is quite a big problem for us in production at the moment and I'm willing to aid in resolving this issue in any way I can.
Edit: Added copies of error messages to the second problem
Hi
My client has given me permission to share with you the document that we are having problems with in the aforementioned second scenario. Here it is CCF12052020.pdf
Hi
My client has given me permission to share with you the document that we are having problems with in the aforementioned second scenario. Here it is CCF12052020.pdf
Should be fixed with https://github.com/rkusa/pdfjs/issues/212
I am still looking into the issue of the other file.
Should be fixed, please try out version 2.3.7
. Thanks for the report as well as for the initial investigation 👍
Please let me know if it is fixed
Should be fixed, please try out version
2.3.7
. Thanks for the report as well as for the initial investigation 👍 Please let me know if it is fixed
Hi. Thanks for the work! I'm doing some quick preliminary tests and things are looking good so far.
However, I just noticed I sent you the wrong file for describing the 2nd problem. Sorry about that. Here is the correct file: Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf
@marcodafonseca Do you know which program crates those PDFs? Because it apparently has a quite severe syntax error:
% java -jar test/preflight-app-2.0.19.jar Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf
The file Nuthurst_Stream_Cross_Parapet_Design_For_Approval_V01.pdf is not a valid PDF/A-1b file, error(s) :
1.1 : Header Syntax error, Second line must begin with '%' followed by at least 4 bytes greater than 127
1.0 : Syntax error, XREF for 11:0 points to wrong object: 10:0
While most PDF readers seem to notice and fix the xref pointing to the wrong object, I never planed to make pdfjs
that permissive towards PDF syntax errors. So I am afraid that I am not investing time in this case myself, sorry. I might consider accepting a PR though if the fix isn't too invasive in comparison to the fact that this is a PDF syntax error.
Do you know which program crates those PDFs?
I'm not sure, hey. To my knowledge the file was potentially scanned using a flatbed scanner of some kind.
Thanks for looking into this, though.
I'm going to try my hand at putting together a fix that isn't too invasive
Please excuse my deleted comment. I had a mistake in how I was converting Buffers. This seems to be working as expected. Thanks!