publicsuffix-ruby
publicsuffix-ruby copied to clipboard
Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it
If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
from (irb):1
from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>
Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>
Related to https://github.com/weppos/publicsuffix-ruby/issues/94 (maybe the list data has changed since?)
Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?
@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
from (irb):7:in `strip!'
from (irb):7:in `block in irb_binding'
from (irb):7:in `each_line'
from (irb):7
from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"
This was with 2.0.3:
irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"
Hmm... maybe I was naive to believe that everything would be good by File.read with encoding: Encoding::UTF_8 just because it doesn't raise any exception. Seems like "网络.cn\n" is read as "\u7F51\u7EDC.cn\n". This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, public_suffix-2.0.3. I don't think I fully understand all the LANG, LANGUAGE, LC_* business.
$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
from (irb):5:in `strip!'
from (irb):5
from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]
$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]
I'm having this problem with version 3.0.3
Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos
Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos
It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.
FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.
@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.
Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.
It's just not a the top of my priorities right now. PRs are always welcome.
This is still broken in 4.0.3 on ruby:2.4-slim-buster docker image.
A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 before calling ruby.
Looks like LANG=C.UTF-8 is enough, the Docker images for Ruby >= 2.5 sets that
$ docker run --rm ruby:2.4-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=2ea0e1a03e36
RUBY_MAJOR=2.4
RUBY_VERSION=2.4.10
RUBY_DOWNLOAD_SHA256=d5668ed11544db034f70aec37d11e157538d639ed0d0a968e2f587191fc530df
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root
vs
$ docker run --rm ruby:2.5-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=7d11ed52a0af
LANG=C.UTF-8
RUBY_MAJOR=2.5
RUBY_VERSION=2.5.8
RUBY_DOWNLOAD_SHA256=0391b2ffad3133e274469f9953ebfd0c9f7c186238968cbdeeb0651aa02a4d6d
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root
Running my initial example
# publicsuffix.rb
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'public_suffix'
end
puts RUBY_VERSION
puts PublicSuffix::List::DEFAULT_LIST_PATH
list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH)
PublicSuffix::List.parse(list_data, private_domains: false)
In ruby:2.4-slim-buster
$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.4-slim-buster bash
root@aa7eb67dce29:/app# gem install bundler
Fetching bundler-2.2.8.gem
Successfully installed bundler-2.2.8
1 gem installed
root@aa7eb67dce29:/app# ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
from publicsuffix.rb:9:in `<main>'
root@aa7eb67dce29:/app# LANG=C.UTF-8 ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
In ruby:2.5-slim-buster
$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.5-slim-buster bash
root@b87a1b578bbf:/app# ruby publicsuffix.rb
2.5.8
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
The problematic code in public_suffix is PublicSuffix::List.default
https://github.com/weppos/publicsuffix-ruby/blob/c4c301231549f98b53bd987c9398b3a366aad815/lib/public_suffix/list.rb#L44-L52
$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'
I'm encountering an error that is probably related to this:
domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)
Raises:
ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)
Forcing UTF-8 works:
domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')
Ruby: 3.0.0 Rails: 6.1.3 Gem: 4.0.6
Two workarounds below.
- Set the encoding using the Ruby interpreter's
-Eflag:
ruby -E utf-8 ./foo.rb
- Set the external encoding progamatically:
require 'public_suffix'
Encoding.default_external = 'utf-8'
puts PublicSuffix.parse('example.com').inspect