pdfrw icon indicating copy to clipboard operation
pdfrw copied to clipboard

pdfrw corrupts some files produced by Microsoft Word

Open AlisterH opened this issue 6 years ago • 3 comments

Hi, Certain pdfs created by Microsoft Word seem to cause a problem with pdfrw. I have attached some example files:

page_break.pdf is created in Microsoft Word by inserting a page break and doing "save as" to pdf. If I use the example watermark.py script to add a watermark (e.g. Test.pdf) then the resulting file is good.

section_break.pdf is created in Microsoft Word by inserting a "section break (next page)" and doing "save as" to pdf. If I use watermark.py to this file then the resulting file is bad i.e. if opened with Adobe Reader, when displaying page 1 it gives a message: "An error exists on this page. Acrobat may not display the page correctly. Please contact the person who created the PDF document to correct the problem.". Some other pdf software also thinks there is an error (Pdf-Xchange actually thinks it can save a fixed copy, but it turns out not to be fixed).

It is tempting to just blame Microsoft Word for creating unusual pdfs, but it would nice if it is something that could be fixed in pdfrw or worked around.

AlisterH avatar Nov 02 '18 02:11 AlisterH

If it is useful to anybody, there are some workarounds:

  • process the corrupted pdf with either pdftocairo (1) or gs (2)
  • process the original input pdf with pdftocairo (3) or gs (4), before processing it with pdfrw

I'm not sure what the best way might be to detect problematic files if that needs to be done, but I do notice this message when doing (1):

Syntax Error: XObject 'pdfrw_0' is unknown

I haven't really compared the different outputs, but testing other large pdfs (that haven't been processed by pdfrw) I see gs also sometimes gives much smaller files.

AlisterH avatar Nov 08 '18 05:11 AlisterH

section_break.pdf was created by Word 2016. The problem also occurs with files created by Word 2010 e.g. Doc1.pdf

AlisterH avatar Nov 14 '18 20:11 AlisterH

I've just tested with pdf files created by Word 2007, and it affects them too.

AlisterH avatar Nov 19 '18 08:11 AlisterH