Data corruption with encrypted PDFs and inline-formatted links
We’re encountered a strange, non-deterministic but reproducible data corruption issue, which seems to occur only under the following circumstances:
- PDF is encrypted via
encrypt_document(owner_password: …), and - PDF contains a link, e.g.
text “<link …”, inline_format: true).
Further down is a simple test script which generates a very basic PDF such as:
Prawn::Document.generate("sample.pdf") do
text %{<link href="https://example.com/">Hello!</link>}, inline_format: true
encrypt_document owner_password: (1..32).map{ rand(256) }.pack("c*")
end
The test script then calls the qpdf command line tool to convert the generated PDF to a plaintext-ish version in order to assert the URL of the link.
Test Case
Steps
- Install the qpdf command line tool, e.g. via
brew install qpdf(OS X) oryum install -y qpdf(CentOS). - Install the latest Prawn via
gem install prawn --version 2.1.0. - Save the script below as
test.rb. - Run the script via
ruby test.rb 100.
We tested with Prawn 2.1.0 under OS X 10.11.4 + Ruby 2.2.4p230 and CentOS 7.2.1511 + Ruby 2.0.0p598.
Expected Output
- 100 lines which all look like this:
tmp/0001-prawn.pdf => OK … - 100 PDF files in
tmp/*-prawn.pdfwhich all contain a clickable link to https://example.com/.
Actual Output
- Several lines will look like this:
tmp/0001-prawn.pdf => BAD …. - The bad PDFs will show the text
Hello!just fine, but the link will be corrupt, as the URL contains some seemingly random garbage.
If you don’t get any bad PDFs, run the script again or increase the number of iterations, e.g. ruby test.rb 1000.
Notes
- The test script generates random passwords using code borrowed from lib/prawn/security.rb… but the issue also occurs if much simpler passwords are used (e.g.
SecureRandom.hex(16)). - The issue also occurs if each run only generates a single PDF, e.g. if you run the script via
for i in {1..100} ; do ruby test.rb 1 ; done. - Some PDF readers (e.g. Preview on OS X) don’t do a good job of showing the actual URL of a link. The easiest way to clearly show the corrupt URL is to open a bad
tmp/*-prawn.pdfPDF in Firefox and then mouse-over theHello!text.
Test Script
require "fileutils"
require "securerandom"
require "prawn" # 2.1.0
samples = (ARGV.first.to_s =~ /\A\d+\z/) ? ARGV.first.to_i : 20
url = "https://example.com/"
FileUtils.mkdir("tmp") unless File.directory?("tmp")
# Passwords are generated with the same method Prawn uses for `owner_password:
# :random` (https://goo.gl/k6Ajbn), but the issue also occurs with much simpler
# passwords, e.g. `SecureRandom.hex(16)`.
owner_passwords = samples.times.collect { (1..32).map{ rand(256) }.pack("c*") } # BAD
# owner_passwords = samples.times.collect { SecureRandom.hex(16) } # BAD
# owner_passwords = samples.times.collect { "123456789-123456789-123456789-12" } # BAD
# owner_passwords = samples.times.collect { "123456789-123456789-123456789-1" } # OK
# owner_passwords = samples.times.collect { "abcdefghijklmnopqrstuvwxyz012312" } # OK
owner_passwords.each_with_index do |owner_password, i|
pdf_path = File.join("tmp", "#{sprintf("%04d", i)}-prawn.pdf")
qdf_path = File.join("tmp", "#{sprintf("%04d", i)}-qdf.pdf")
# Generate a very basic PDF
print "#{pdf_path}"
Prawn::Document.generate(pdf_path) do
text %{<link href="#{url}">Hello!</link>}, inline_format: true
encrypt_document owner_password: owner_password
end
# Call `qpdf` (http://qpdf.sourceforge.net/) so we have a plaintext-ish
# version of the PDF to assert the URL of the link
`qpdf --qdf #{pdf_path} #{qdf_path} > /dev/null 2>&1`
expected = "/URI (#{url})"
qdf = File.read(qdf_path)
print " => #{qdf.include?(expected) ? "OK " : "BAD"} owner_password: #{owner_password.inspect}"
puts
end
@seelensonne Thank you for detailed report. We will take a look as soon as we will have time on our hands.
@gettalong Do you think this is the same issue that you fixed with #1242 ?
@petergoldstein Yes, most likely, but I haven't run the provided test script against master.
@gettalong Ok. Let me take a look then and see if the script works against current master
So the script still fails against current master, but the commonality appears to be that the encrypted string contains an \r. So something still seems to be wrong with the encoding against that case. Here are a few encrypted versions of the URIs that fail/succeed:
Fail:
\x8B\x10\x13\xF86D\r\x16y\xCDt\xD6om4x\xC5\xD0\xF3w\f{\rdz\x86\xBCJt\xE9t\xEF\xB0\xC9}\xD14\xAC8\xDE
Succeed:
;\a\xB2\xDA'\xB6\xED\x05\xC8H]\xE0:\x808\x10\xF1\xE7\x96`\xE0\xA6\x15\xCA\xA4\xF0\xCA\xA1\xCA}\xFFt\x81+\xEF\xB1\xB1-\v\xF1
The presence or absence of a \r appears to be the determining factor.
Hmm... the failing string \f{\rdz\x86\xBCJt\xE9t\xEF\xB0\xC9}\xD14\xAC8\xDE should not contain an unescaped \r with the fixed version, \r should always be preceded by \. So somehow this is not done. Is the line https://github.com/prawnpdf/prawn/pull/1242/files#diff-63c439227654808d583cba1f90c452658f253b8a680a4545db1a1498583a863aR232 really hit? Because this should be the place where \r gets turned into \\r
So these are being captured in code before the escape.
I've confirmed that the escape is getting called, but something is still wrong in the process.
Okay, I will have a look at it. Could you provide a sample Prawn::Document.generate ... script that produces a bad PDF?
Here's a script I can run from the root of my Git repo (current master):
require './lib/prawn'
require "fileutils"
require "securerandom"
url = "https://example.com/"
FileUtils.mkdir("tmp") unless File.directory?("tmp")
owner_password = "2332edc14c879ee5cf9c5e2f9892eb54"
pdf_path = File.join("tmp", "corrupt-prawn.pdf")
Prawn::Document.generate(pdf_path) do
text %{<link href="#{url}">Hello!</link>}, inline_format: true
encrypt_document owner_password: owner_password
end
owner_password = "abcd"
pdf_path = File.join("tmp", "valid-prawn.pdf")
Prawn::Document.generate(pdf_path) do
text %{<link href="#{url}">Hello!</link>}, inline_format: true
encrypt_document owner_password: owner_password
end
which produces these two files: corrupt-prawn.pdf valid-prawn.pdf
The user password field is empty, so you don't actually need a password to open. You can see the link in valid-prawn.pdf is fine, while the one in corrupt-prawn.pdf is ... corrupt.
@petergoldstein Thanks for the test case - I think I found the issue and will provide a fix. :confused: :unamused: Backslash escapes are always a bit hard to think through and my previous fix was not the correct one.
@petergoldstein @pointlessone Fixes are here for pdf-core and here for Prawn. The one in Prawn depends on the one in pdf-core.
@gettalong If you haven't already I'd recommend yo run the original script that @seelensonne provided in this ticket to verify the changes. Running it with a sufficiently high sample count should probe the entire range of characters and make sure that we're encoding correctly. The script was very helpful in finding failing cases.
@petergoldstein Yeah, all cases come back OK. Ran it with 1000 iterations which should be enough.
The fix has been merged. Thank you.