v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
Thanks for maintaining this library :heart:
I noticed that #7 helped to prompt a v2 release, and over in https://github.com/yob/pdf-reader/issues/538 I've had a suggestion to relax the pdf-reader dependencies to allow v2 to be used.
I gave it a go, but the CI build on ruby versions that installed Ascii85 v2 failed, for example: https://buildkite.com/yob-opensource/pdf-reader/builds/629#0191ac89-884b-4fc1-ace3-3f1a7b11258a
The input data was pulled from a test PDF and is hard to work with for a reproduction, so I trimmed the sample down and put together a short script:
# coding: utf-8
require "bundler/inline"
gemfile do
source "https://rubygems.org"
#gem 'Ascii85', '1.1.1'
gem 'Ascii85', '2.0.0'
end
require 'ascii85'
data = %Q{<~8;Xu[gMYb*&H)\\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>}
puts data
puts "*****************************"
puts "input utf8"
puts "*****************************"
puts data.encoding
puts data.valid_encoding?
res = Ascii85.decode(data)
puts res.inspect
If I flip the Ascii85 version between v1 and v2: the input data works on v1.1.1 and raises an exception on v2.0.0:
$ ruby repro.rb
<~8;Xu[gMYb*&H)\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>
*****************************
input utf8
*****************************
UTF-8
true
/home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:345:in `write': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:298:in `decode_raw'
from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:192:in `decode'
from repro.rb:24:in `<main>'
The output data is expected to be binary and not valid UTF-8. I assume I might be able to work around it by using the new v2 API to pass in a binary encoded output buffer, however pdf-reader still supports rubies < 2.7 so I'm aiming to use the v1 compatible parts of Ascii85s API
Thank you for such an outstanding bug report! The script made it really easy to reproduce the problem.
I was too confident in my specs protecting me from this kind of issue, but I have insufficiently tested binary data it seems.
I managed to distill your example down into a short string that triggers the issue, <~S$ojXOT~> (OU and OT are equivalent because the last bits get chopped off, but the gem produces OT when encoding the data), but alas I have now run out of time for today.
I think the issue can mostly be solved by spamming force_encoding(Encoding::ASCII_8BIT) throughout the code, to make sure that the gem always uses the BINARY encoding instead of the default UTF-8, but that is a rather ugly solution.
Still, I will shortly push a commit doing just that, and it at least makes the example pass -- but I'm not 100% sure if I managed to catch every instance of the problem.
I probably won't be able to work on this again before Wednesday; I'll try to see if I can uncover more edge cases that can lead to problems then.
Sounds good. There's no urgency from my perspective, released versions of pdf-reader are locked to v1.x so they're continuing to work fine.
I can see a fix has been pushed to main, so I gave it a go (https://github.com/yob/pdf-reader/compare/ascii85-2-0?expand=1). The pdf-reader spec suite is green (some jobs failed, but for unrelated reasons): Here's a passing example, on ruby 3.3 https://buildkite.com/yob-opensource/pdf-reader/builds/630#0191b783-766d-4fa6-b7e4-b9583a832f1e
Thank you for testing the changes!
I went through the code again on the weekend and made sure that all String literals are unfrozen and encoded as ASCII_8BIT before use; that should take care of the encoding errors.
The new version has also managed to correctly encode and then decode a few gigabytes of random binary data without raising an Exception, so I hope that it works properly now.
Unless something else crops up, I'll probably release version 2.0.1 this weekend.
Thanks for your help here! I've released pdf-reader with a relaxed Ascii85 dependency and all our tests are green :heart: