prawn icon indicating copy to clipboard operation
prawn copied to clipboard

Data corruption with encrypted PDFs and inline-formatted links

Open seelensonne opened this issue 9 years ago • 13 comments

We’re encountered a strange, non-deterministic but reproducible data corruption issue, which seems to occur only under the following circumstances:

  • PDF is encrypted via encrypt_document(owner_password: …), and
  • PDF contains a link, e.g. text “<link …”, inline_format: true).

Further down is a simple test script which generates a very basic PDF such as:

Prawn::Document.generate("sample.pdf") do
  text %{<link href="https://example.com/">Hello!</link>}, inline_format: true
  encrypt_document owner_password: (1..32).map{ rand(256) }.pack("c*")
end

The test script then calls the qpdf command line tool to convert the generated PDF to a plaintext-ish version in order to assert the URL of the link.

Test Case

Steps

  1. Install the qpdf command line tool, e.g. via brew install qpdf (OS X) or yum install -y qpdf (CentOS).
  2. Install the latest Prawn via gem install prawn --version 2.1.0.
  3. Save the script below as test.rb.
  4. Run the script via ruby test.rb 100.

We tested with Prawn 2.1.0 under OS X 10.11.4 + Ruby 2.2.4p230 and CentOS 7.2.1511 + Ruby 2.0.0p598.

Expected Output

  • 100 lines which all look like this: tmp/0001-prawn.pdf => OK …
  • 100 PDF files in tmp/*-prawn.pdf which all contain a clickable link to https://example.com/.

Actual Output

  • Several lines will look like this: tmp/0001-prawn.pdf => BAD ….
  • The bad PDFs will show the text Hello! just fine, but the link will be corrupt, as the URL contains some seemingly random garbage.

If you don’t get any bad PDFs, run the script again or increase the number of iterations, e.g. ruby test.rb 1000.

Notes

  • The test script generates random passwords using code borrowed from lib/prawn/security.rb… but the issue also occurs if much simpler passwords are used (e.g. SecureRandom.hex(16)).
  • The issue also occurs if each run only generates a single PDF, e.g. if you run the script via for i in {1..100} ; do ruby test.rb 1 ; done.
  • Some PDF readers (e.g. Preview on OS X) don’t do a good job of showing the actual URL of a link. The easiest way to clearly show the corrupt URL is to open a bad tmp/*-prawn.pdf PDF in Firefox and then mouse-over the Hello! text.

Test Script

require "fileutils"
require "securerandom"
require "prawn" # 2.1.0

samples = (ARGV.first.to_s =~ /\A\d+\z/) ? ARGV.first.to_i : 20
url = "https://example.com/"

FileUtils.mkdir("tmp") unless File.directory?("tmp")

# Passwords are generated with the same method Prawn uses for `owner_password:
# :random` (https://goo.gl/k6Ajbn), but the issue also occurs with much simpler
# passwords, e.g. `SecureRandom.hex(16)`.
owner_passwords = samples.times.collect { (1..32).map{ rand(256) }.pack("c*") } # BAD
# owner_passwords = samples.times.collect { SecureRandom.hex(16) } # BAD
# owner_passwords = samples.times.collect { "123456789-123456789-123456789-12" } # BAD
# owner_passwords = samples.times.collect { "123456789-123456789-123456789-1" } # OK
# owner_passwords = samples.times.collect { "abcdefghijklmnopqrstuvwxyz012312" } # OK

owner_passwords.each_with_index do |owner_password, i|
  pdf_path = File.join("tmp", "#{sprintf("%04d", i)}-prawn.pdf")
  qdf_path = File.join("tmp", "#{sprintf("%04d", i)}-qdf.pdf")

  # Generate a very basic PDF
  print "#{pdf_path}"
  Prawn::Document.generate(pdf_path) do
    text %{<link href="#{url}">Hello!</link>}, inline_format: true
    encrypt_document owner_password: owner_password
  end

  # Call `qpdf` (http://qpdf.sourceforge.net/) so we have a plaintext-ish
  # version of the PDF to assert the URL of the link
  `qpdf --qdf #{pdf_path} #{qdf_path} > /dev/null 2>&1`
  expected = "/URI (#{url})"
  qdf = File.read(qdf_path)
  print " => #{qdf.include?(expected) ? "OK " : "BAD"} owner_password: #{owner_password.inspect}"
  puts
end

seelensonne avatar Apr 14 '16 05:04 seelensonne

@seelensonne Thank you for detailed report. We will take a look as soon as we will have time on our hands.

pointlessone avatar Apr 14 '16 07:04 pointlessone

@gettalong Do you think this is the same issue that you fixed with #1242 ?

petergoldstein avatar Feb 17 '22 20:02 petergoldstein

@petergoldstein Yes, most likely, but I haven't run the provided test script against master.

gettalong avatar Feb 17 '22 20:02 gettalong

@gettalong Ok. Let me take a look then and see if the script works against current master

petergoldstein avatar Feb 17 '22 20:02 petergoldstein

So the script still fails against current master, but the commonality appears to be that the encrypted string contains an \r. So something still seems to be wrong with the encoding against that case. Here are a few encrypted versions of the URIs that fail/succeed:

Fail:

  • \x8B\x10\x13\xF86D\r\x16y\xCDt\xD6om4x\xC5\xD0\xF3w
  • \f{\rdz\x86\xBCJt\xE9t\xEF\xB0\xC9}\xD14\xAC8\xDE

Succeed:

  • ;\a\xB2\xDA'\xB6\xED\x05\xC8H]\xE0:\x808\x10\xF1\xE7\x96`
  • \xE0\xA6\x15\xCA\xA4\xF0\xCA\xA1\xCA}\xFFt\x81+\xEF\xB1\xB1-\v\xF1

The presence or absence of a \r appears to be the determining factor.

petergoldstein avatar Feb 17 '22 21:02 petergoldstein

Hmm... the failing string \f{\rdz\x86\xBCJt\xE9t\xEF\xB0\xC9}\xD14\xAC8\xDE should not contain an unescaped \r with the fixed version, \r should always be preceded by \. So somehow this is not done. Is the line https://github.com/prawnpdf/prawn/pull/1242/files#diff-63c439227654808d583cba1f90c452658f253b8a680a4545db1a1498583a863aR232 really hit? Because this should be the place where \r gets turned into \\r

gettalong avatar Feb 17 '22 22:02 gettalong

So these are being captured in code before the escape.

I've confirmed that the escape is getting called, but something is still wrong in the process.

petergoldstein avatar Feb 17 '22 22:02 petergoldstein

Okay, I will have a look at it. Could you provide a sample Prawn::Document.generate ... script that produces a bad PDF?

gettalong avatar Feb 17 '22 23:02 gettalong

Here's a script I can run from the root of my Git repo (current master):

require './lib/prawn'
require "fileutils"
require "securerandom"

url = "https://example.com/"

FileUtils.mkdir("tmp") unless File.directory?("tmp")


owner_password = "2332edc14c879ee5cf9c5e2f9892eb54"
pdf_path = File.join("tmp", "corrupt-prawn.pdf")
Prawn::Document.generate(pdf_path) do
  text %{<link href="#{url}">Hello!</link>}, inline_format: true
  encrypt_document owner_password: owner_password
end

owner_password = "abcd"
pdf_path = File.join("tmp", "valid-prawn.pdf")
Prawn::Document.generate(pdf_path) do
  text %{<link href="#{url}">Hello!</link>}, inline_format: true
  encrypt_document owner_password: owner_password
end

which produces these two files: corrupt-prawn.pdf valid-prawn.pdf

The user password field is empty, so you don't actually need a password to open. You can see the link in valid-prawn.pdf is fine, while the one in corrupt-prawn.pdf is ... corrupt.

petergoldstein avatar Feb 18 '22 00:02 petergoldstein

@petergoldstein Thanks for the test case - I think I found the issue and will provide a fix. :confused: :unamused: Backslash escapes are always a bit hard to think through and my previous fix was not the correct one.

gettalong avatar Feb 18 '22 08:02 gettalong

@petergoldstein @pointlessone Fixes are here for pdf-core and here for Prawn. The one in Prawn depends on the one in pdf-core.

gettalong avatar Feb 18 '22 12:02 gettalong

@gettalong If you haven't already I'd recommend yo run the original script that @seelensonne provided in this ticket to verify the changes. Running it with a sufficiently high sample count should probe the entire range of characters and make sure that we're encoding correctly. The script was very helpful in finding failing cases.

petergoldstein avatar Feb 19 '22 15:02 petergoldstein

@petergoldstein Yeah, all cases come back OK. Ran it with 1000 iterations which should be enough.

gettalong avatar Feb 19 '22 18:02 gettalong

The fix has been merged. Thank you.

pointlessone avatar Mar 04 '24 10:03 pointlessone