qubes-issues
qubes-issues copied to clipboard
Convert to Safe PDF crashes when "fancy" PDFs are tried
The problem you're addressing (if any) In a recent user research interview, a user relayed to me that PDFs they had been sent by unknown 3rd parties (so, precisely what this tool was built to process) do not work with this tool and effectively crash it, when those PDFs are "too fancy." When I inquired about what that meant, exactly, the user cited the plethora of advanced PDF features Adobe is able to embed in PDFs when those PDFs are created with its own tooling. Other apps may also leverage those advanced capabilities built and licensed by Adobe. I do not know.
I do recall as a former Adobe employee, that they have major contracts with State defense departments and multiple other B2B customers to help them embed interactivity, signature collection, and possibly embedded videos—and probably interactive & tracking stuff I never noticed before—in PDFs generated by their tools.
Describe the solution you'd like Talented developers looking at the code and fixing it up such that the most bloated and ridiculous PDF fails to crash this tool. Regrettably, my own talents fall short of that.
Where is the value to a user, and who might that user be? All users: sparing folks the "FAFO" trial-and-error disappointment of a tool not delivering on its implied promise. Non-Linux-Native users, and newbies to Libre stuff: a better experience for users unaccustomed to reading documentation before diving into use a tool/utility.
Describe alternatives you've considered
Currently this user observes that Convert To Trusted PDF gets hung-up or crashes. His workaround, is to then "Print to PDF" from a viewer app in a dvm, and then he runs that output through Convert To Trusted PDF.
At a minimum, it'd be nice if the utility could detect such problematic attributes in PDFs in a pre-flight examination of the document, and then message users with this proposed workaround before attempting to process the PDF. Happy to work with anyone interested in taking this on, to suggest user-facing things.
Additional context For Reasons I cannot publicly share PDFs that caused this, regrettably—but I have asked the user to keep in touch with me as they are able to, so that I can learn more. I will share those insights here, should they happen.
Related, non-duplicate issues #6181 also proposes an unrelated enhancement to Trusted PDF converter
An obvious solution is to switch from ImageMagick to pdf.js or PDFium. This is a good idea for numerous reasons, not least of which is security.
Thaaaat would help users who depend upon this feature, a great deal!
Thaaaat would help users who depend upon this feature, a great deal!
ImageMagick uses GhostScript for PDF processing, but GhostScript is not a very good PDF renderer. In fact, it has proven to be so full of security holes that it is often (rightly!) disabled in the ImageMagick security policy. Furthermore, GhostScript is AGPLv3, which a lot of companies really do not like.
The safest choice is without question Mozilla’s pdf.js. It is written in a memory-safe language (JavaScript) and runs within the browser sandbox, providing an extra layer of isolation. It has been built into Firefox for many years and is considered production quality. The main drawback is that it is not easy to embed. Spawning a full web browser is expensive, and would require additional work to dump the rendered PDF to a bitmap. Node.js is supported, but seems to have bugs in the emulation of browser APIs that cause problems. Furthermore, I expect that pdf.js probably uses lots of memory, although I have not measured it.
Another option is PDFium, which is used by Chromium. PDFium is a C++11 library with a C API that should (famous last words) be fairly easy to embed in a command-line tool. It also undergoes regular security reviews by Google. The main drawback is that it is part of Chromium and so not easy to build. Qubes OS would likely need to rip it out of a Chromium SRPM.
@DemiMarie You are making me feel much better about having opened this issue, I'd had no idea about any of that! This is all terrific to learn.
On Tue, Jun 08, 2021 at 03:50:00PM -0700, Demi Marie Obenour wrote:
Thaaaat would help users who depend upon this feature, a great deal!
ImageMagick uses GhostScript for PDF processing, but GhostScript is not a very good PDF renderer. In fact, it has proven to be so full of security holes that it is often (rightly!) disabled in the ImageMagick security policy. Furthermore, GhostScript is AGPLv3, which a lot of companies really do not like.
The safest choice is without question Mozilla???s pdf.js. It is written in a memory-safe language (JavaScript) and runs within the browser sandbox, providing an extra layer of isolation. It has been built into Firefox for many years and is considered production quality. The main drawback is that it is not easy to embed. Spawning a full web browser is expensive, and would require additional work to dump the rendered PDF to a bitmap. Node.js is supported, but seems to have bugs in the emulation of browser APIs that cause problems. Furthermore, I expect that pdf.js probably uses lots of memory, although I have not measured it.
Another option is PDFium, which is used by Chromium. PDFium is a C++11 library with a C API that should (famous last words) be fairly easy to embed in a command-line tool. It also undergoes regular security reviews by Google. The main drawback is that it is part of Chromium and so not easy to build. Qubes OS would likely need to rip it out of a Chromium SRPM.
To aid in testing could you provide some examples of "fancy" PDFs.
To be clear: The architecture of the PDF converter is specifically done to not worry about the rendering process security. The main factor when choosing what to use to render should be its accuracy, not necessarily security. If we can have both, then fine, but we don't need to compromise here of accuracy to have better security. I say this because @DemiMarie listed almost solely security properties of those renders, which are not that important factors when choosing it here.
To be clear: The architecture of the PDF converter is specifically done to not worry about the rendering process security. The main factor when choosing what to use to render should be its accuracy, not necessarily security. If we can have both, then fine, but we don't need to compromise here of accuracy to have better security. I say this because @DemiMarie listed almost solely security properties of those renders, which are not that important factors when choosing it here.
Good point. PDFium and Poppler are probably the best choices. PDFium is probably a better renderer, but Poppler is probably easier to integrate.
To aid in testing could you provide some examples of "fancy" PDFs.
@unman I'd be happy to cobble some together—but the exact ones my user cited, I regrettably cannot obtain. Will follow-up with them to get more detail, and will post some links to fancy PDFs here as soon as I'm able to (but likely not for at least another week).
I'm not sure if Dangerzone will fail on the same PDF or not, but that tool uses Poppler (the pdftocairo command specifically) to convert single-page PDFs to PNG files. Here's the relevant file: https://github.com/firstlookmedia/dangerzone-converter/blob/stable/scripts/document-to-pixels-unpriv
I also learned from the mentioned user, that it was not Acrobat™ or other Adobe tooling-generated PDFs that were "fancy," but PowerPoint generated PDFs that created the crash. @unman @DemiMarie
I also learned from the mentioned user, that it was not Acrobat™ or other Adobe tooling-generated PDFs that were "fancy," but PowerPoint generated PDFs that created the crash. @unman @DemiMarie
pdftocairo is a much better idea than Ghostscript. Poppler powers the Evince reader, so it should be reasonably good.
pdftocairois a much better idea than Ghostscript
This is also what we use: https://github.com/QubesOS/qubes-app-linux-pdf-converter/blob/master/qubespdfconverter/server.py#L141-L153
I think you got confused with ImageMagick, as it's used to convert pdftocairo's output to RGB data stream.
pdftocairois a much better idea than GhostscriptThis is also what we use: https://github.com/QubesOS/qubes-app-linux-pdf-converter/blob/master/qubespdfconverter/server.py#L141-L153
In that case PDFium is probably the best choice.
I think you got confused with ImageMagick, as it's used to convert
pdftocairo's output to RGB data stream.
We might want to replace ImageMagick for performance reasons.
What would ImageMagick be replaced with?
What would ImageMagick be replaced with?
libvips? libpng?
On Thu, Jun 10, 2021 at 11:56:51PM -0700, Nina Eleanor Alter wrote:
I also learned from the mentioned user, that it was not Acrobat??? or other Adobe tooling-generated PDFs that were "fancy," but PowerPoint generated PDFs that created the crash. @unman @DemiMarie
I have been experimenting with PDFs generated from PowerPoint, and still cant hit this problem, with documents with more than 50 pages.
Is it not possible for the reporter to generate some document (using stock images and stock text), that will create the crash? If not, can they provide very much more information about the contents of the PDF (form, not content), the PowerPoint version, etc etc. If not, can someone else generate a PDF (possibly from PowerPoint) that will crash the converter, and provide a sample.
It would be good if we curated a test suite of documents - I have some sample PDFs and images, but they don't cause issues.
@unman I don't think it's the volumne of pages in a PPT, but rather a bunch of transitions, animations, and other effects. Have you tested with any of those? I can help by looking for some as a background activity to watching movies.
On Sun, Jul 04, 2021 at 12:32:56PM -0700, Nina Eleanor Alter wrote:
@unman I don't think it's the volumne of pages in a PPT, but rather a bunch of transitions, animations, and other effects. Have you tested with any of those? I can help by looking for some as a background activity to watching movies.
I've tried with animations and some other effects, but dont have time to experiment in the dark. It would be much easier if the original reporter could provide a sample or detail on the content.
Not the original reporter, but I've been experiencing this issue with PDF books. I've attached a sample PDF. The following is the output of qvm-convert-pdf Introduction to[...]:
Sending file to Disposable VM...
Introduction_to_algorithms-3rd Edition.pdf...50/1313
Total Sanitized Files: 0/1
Introduction_to_algorithms-3rd Edition.pdf
Hope this is helpful.
I think the best option is to replace pdftocairo with either PDFium or with PDF.js.
On Fri, Jan 14, 2022 at 05:08:09PM -0800, Jordan S. wrote:
Not the original reporter, but I've been experiencing this issue with PDF books. I've attached a sample PDF. The following is the output of
qvm-convert-pdf Introduction to[...]:Sending file to Disposable VM... Introduction_to_algorithms-3rd Edition.pdf...50/1313 Total Sanitized Files: 0/1Introduction_to_algorithms-3rd Edition.pdf
Hope this is helpful.
Thank you - a sample at last.
This appears to have been diagnosed by @tungsten987 in #7759.
Today I learned how to create Pull Requests so I finally took the time to create one for my proposed fix to this issue (see above). I've been using the fix myself for months for a lot of PDF files and it works great. The reasoning for the fix is explained in #7759.
Closing as completed. If anyone believes this issue is not yet completed, or if anyone is still affected by this issue, please leave a comment, and we'll be happy to reopen it. Thank you.