ascii85gem icon indicating copy to clipboard operation
ascii85gem copied to clipboard

v2: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)

Open yob opened this issue 1 year ago • 3 comments

Thanks for maintaining this library :heart:

I noticed that #7 helped to prompt a v2 release, and over in https://github.com/yob/pdf-reader/issues/538 I've had a suggestion to relax the pdf-reader dependencies to allow v2 to be used.

I gave it a go, but the CI build on ruby versions that installed Ascii85 v2 failed, for example: https://buildkite.com/yob-opensource/pdf-reader/builds/629#0191ac89-884b-4fc1-ace3-3f1a7b11258a

The input data was pulled from a test PDF and is hard to work with for a reproduction, so I trimmed the sample down and put together a short script:

# coding: utf-8

require "bundler/inline"

gemfile do
  source "https://rubygems.org"

  #gem 'Ascii85', '1.1.1'
  gem 'Ascii85', '2.0.0'
end

require 'ascii85'

data = %Q{<~8;Xu[gMYb*&H)\\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>}
puts data

puts "*****************************"
puts "input utf8"
puts "*****************************"

puts data.encoding
puts data.valid_encoding?

res = Ascii85.decode(data)
puts res.inspect

If I flip the Ascii85 version between v1 and v2: the input data works on v1.1.1 and raises an exception on v2.0.0:

$ ruby repro.rb 
<~8;Xu[gMYb*&H)\/_`Xe]6AlP#",[.!-NsinYhkAeQYmmYS$ojXOU~>
*****************************
input utf8
*****************************
UTF-8
true
/home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:345:in `write': incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError)
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:298:in `decode_raw'
        from /home/jh/.rbenv/versions/3.3.3/lib/ruby/gems/3.3.0/gems/Ascii85-2.0.0/lib/ascii85.rb:192:in `decode'
        from repro.rb:24:in `<main>'

The output data is expected to be binary and not valid UTF-8. I assume I might be able to work around it by using the new v2 API to pass in a binary encoded output buffer, however pdf-reader still supports rubies < 2.7 so I'm aiming to use the v1 compatible parts of Ascii85s API

yob avatar Sep 01 '24 11:09 yob

Thank you for such an outstanding bug report! The script made it really easy to reproduce the problem.

I was too confident in my specs protecting me from this kind of issue, but I have insufficiently tested binary data it seems.

I managed to distill your example down into a short string that triggers the issue, <~S$ojXOT~> (OU and OT are equivalent because the last bits get chopped off, but the gem produces OT when encoding the data), but alas I have now run out of time for today.

I think the issue can mostly be solved by spamming force_encoding(Encoding::ASCII_8BIT) throughout the code, to make sure that the gem always uses the BINARY encoding instead of the default UTF-8, but that is a rather ugly solution.

Still, I will shortly push a commit doing just that, and it at least makes the example pass -- but I'm not 100% sure if I managed to catch every instance of the problem.

I probably won't be able to work on this again before Wednesday; I'll try to see if I can uncover more edge cases that can lead to problems then.

DataWraith avatar Sep 02 '24 19:09 DataWraith

Sounds good. There's no urgency from my perspective, released versions of pdf-reader are locked to v1.x so they're continuing to work fine.

I can see a fix has been pushed to main, so I gave it a go (https://github.com/yob/pdf-reader/compare/ascii85-2-0?expand=1). The pdf-reader spec suite is green (some jobs failed, but for unrelated reasons): Here's a passing example, on ruby 3.3 https://buildkite.com/yob-opensource/pdf-reader/builds/630#0191b783-766d-4fa6-b7e4-b9583a832f1e

yob avatar Sep 03 '24 10:09 yob

Thank you for testing the changes!

I went through the code again on the weekend and made sure that all String literals are unfrozen and encoded as ASCII_8BIT before use; that should take care of the encoding errors.

The new version has also managed to correctly encode and then decode a few gigabytes of random binary data without raising an Exception, so I hope that it works properly now.

Unless something else crops up, I'll probably release version 2.0.1 this weekend.

DataWraith avatar Sep 11 '24 10:09 DataWraith

Thanks for your help here! I've released pdf-reader with a relaxed Ascii85 dependency and all our tests are green :heart:

yob avatar Nov 02 '24 00:11 yob