publicsuffix-ruby icon indicating copy to clipboard operation
publicsuffix-ruby copied to clipboard

Always read data/list.txt as UTF-8 to avoid "ArgumentError: invalid byte sequence in US-ASCII" when parsing it

Open dentarg opened this issue 9 years ago • 13 comments

If your environment fails to specify UTF-8, Ruby defaults to US-ASCII and when public_suffix try to parse the list data, it fails:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `strip!'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:89:in `block (2 levels) in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `each_line'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:88:in `block in parse'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:128:in `initialize'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `new'
    from /Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.2/lib/public_suffix/list.rb:87:in `parse'
    from (irb):1
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):002:0> Encoding.default_external
=> #<Encoding:US-ASCII>
irb(main):003:0> RUBY_VERSION
=> "2.2.5"
irb(main):004:0>

Passing encoding: Encoding::UTF_8 to File.read makes it work, even if the default encoding isn't UTF-8:

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8) ; PublicSuffix::List.parse(list_data, private_domains: false) ; nil
=> nil
irb(main):002:0> RUBY_VERSION
=> "2.2.5"
irb(main):003:0> Encoding.default_external
=> #<Encoding:US-ASCII>

Related to https://github.com/weppos/publicsuffix-ruby/issues/94 (maybe the list data has changed since?)

dentarg avatar Sep 19 '16 13:09 dentarg

Thankis @dentarg, I'll investigate. Are you able to tell me which line in the definition file is causing the issue?

weppos avatar Oct 15 '16 12:10 weppos

@weppos I hope this help (I'm in a hurry now, so I haven't checked this too closely)

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix' ; list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH) ; nil
=> nil
irb(main):002:0> list_data.class
=> String
irb(main):007:0> ctr = 0 ; outside_line = "" ; list_data.each_line { |line| ctr += 1 ; outside_line = line ; line.strip! } ; nil
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):7:in `strip!'
    from (irb):7:in `block in irb_binding'
    from (irb):7:in `each_line'
    from (irb):7
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):008:0> ctr
=> 610
irb(main):009:0> outside_line
=> "\xE5\x85\xAC\xE5\x8F\xB8.cn\n"

dentarg avatar Oct 16 '16 17:10 dentarg

This was with 2.0.3:

irb(main):010:0> PublicSuffix::List::DEFAULT_LIST_PATH
=> "/Users/dentarg/.gem/ruby/2.2.5/gems/public_suffix-2.0.3/lib/public_suffix/../../data/list.txt"

dentarg avatar Oct 16 '16 17:10 dentarg

Hmm... maybe I was naive to believe that everything would be good by File.read with encoding: Encoding::UTF_8 just because it doesn't raise any exception. Seems like "网络.cn\n" is read as "\u7F51\u7EDC.cn\n". This is on OS X 10.11.6, Ruby 2.2.5, zsh 5.0.8, public_suffix-2.0.3. I don't think I fully understand all the LANG, LANGUAGE, LC_* business.

$ LANG= LANGUAGE= LC_ALL= LC_CTYPE= irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610]
=> "\u7F51\u7EDC.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH, encoding: Encoding::UTF_8).each_line.to_a[610].strip!
=> "\u7F51\u7EDC.cn"
irb(main):004:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "\xE7\xBD\x91\xE7\xBB\x9C.cn\n"
irb(main):005:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
ArgumentError: invalid byte sequence in US-ASCII
    from (irb):5:in `strip!'
    from (irb):5
    from /Users/dentarg/.rubies/ruby-2.2.5/bin/irb:11:in `<main>'
irb(main):006:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["", "", "", ""]
$ irb
irb(main):001:0> require 'public_suffix'
=> true
irb(main):002:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610]
=> "网络.cn\n"
irb(main):003:0> File.read(PublicSuffix::List::DEFAULT_LIST_PATH).each_line.to_a[610].strip!
=> "网络.cn"
irb(main):004:0> %w(LANG LANGUAGE LC_ALL LC_CTYPE).map { |v| ENV[v] }
=> ["en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8", "en_US.UTF-8"]

dentarg avatar Oct 16 '16 22:10 dentarg

I'm having this problem with version 3.0.3

tamoyal avatar Sep 08 '18 16:09 tamoyal

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

SeanDunford avatar Apr 03 '19 21:04 SeanDunford

Bump. Is this project dead? Does anyone have a fork or alternate project where this is working? @weppos

It is not dead. If your operating environment is set with the correct UTF8 language value, the library will work perfectly.

weppos avatar Apr 04 '19 08:04 weppos

FWIW, it would seem correct if gem wouldn't depend/be agnostic to any environment setups for nominal operation.

aleksandrs-ledovskis avatar Apr 04 '19 11:04 aleksandrs-ledovskis

@SeanDunford @aleksandrs-ledovskis feel free to provide a patch and I will review it. So far, the only one that provided a practical help was @dentarg but even him admitted the problem may not be that easy to solve.

Frankly, I am reluctant to put any effort into trying to make UTF-8 work because the real solution is to pre-process the list and have it stored in Punycode as this is how names should be managed and compared.

It's just not a the top of my priorities right now. PRs are always welcome.

weppos avatar Apr 04 '19 12:04 weppos

This is still broken in 4.0.3 on ruby:2.4-slim-buster docker image.

A workaround is setting: LANG=en_US.UTF-8 LANGUAGE=en_US.UTF-8 LC_ALL=en_US.UTF-8 before calling ruby.

alexef avatar Feb 05 '21 10:02 alexef

Looks like LANG=C.UTF-8 is enough, the Docker images for Ruby >= 2.5 sets that
$ docker run --rm ruby:2.4-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=2ea0e1a03e36
RUBY_MAJOR=2.4
RUBY_VERSION=2.4.10
RUBY_DOWNLOAD_SHA256=d5668ed11544db034f70aec37d11e157538d639ed0d0a968e2f587191fc530df
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

vs

$ docker run --rm ruby:2.5-slim-buster env
PATH=/usr/local/bundle/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
HOSTNAME=7d11ed52a0af
LANG=C.UTF-8
RUBY_MAJOR=2.5
RUBY_VERSION=2.5.8
RUBY_DOWNLOAD_SHA256=0391b2ffad3133e274469f9953ebfd0c9f7c186238968cbdeeb0651aa02a4d6d
RUBYGEMS_VERSION=3.0.3
GEM_HOME=/usr/local/bundle
BUNDLE_SILENCE_ROOT_WARNING=1
BUNDLE_APP_CONFIG=/usr/local/bundle
HOME=/root

Running my initial example

# publicsuffix.rb
require 'bundler/inline'
gemfile do
  source 'https://rubygems.org'
  gem 'public_suffix'
end
puts RUBY_VERSION
puts PublicSuffix::List::DEFAULT_LIST_PATH
list_data = File.read(PublicSuffix::List::DEFAULT_LIST_PATH)
PublicSuffix::List.parse(list_data, private_domains: false)

In ruby:2.4-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.4-slim-buster bash
root@aa7eb67dce29:/app# gem install bundler
Fetching bundler-2.2.8.gem
Successfully installed bundler-2.2.8
1 gem installed
root@aa7eb67dce29:/app# ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from publicsuffix.rb:9:in `<main>'
root@aa7eb67dce29:/app# LANG=C.UTF-8 ruby publicsuffix.rb
2.4.10
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

In ruby:2.5-slim-buster

$ docker run --rm -it -v $(pwd):/app -w /app ruby:2.5-slim-buster bash
root@b87a1b578bbf:/app# ruby publicsuffix.rb
2.5.8
/usr/local/bundle/gems/public_suffix-4.0.6/data/list.txt

The problematic code in public_suffix is PublicSuffix::List.default

https://github.com/weppos/publicsuffix-ruby/blob/c4c301231549f98b53bd987c9398b3a366aad815/lib/public_suffix/list.rb#L44-L52

$ docker run --rm -it ruby:2.4-slim-buster bash
root@31cd6631fcaa:/# gem install public_suffix
Fetching public_suffix-4.0.6.gem
Successfully installed public_suffix-4.0.6
1 gem installed
root@31cd6631fcaa:/# ruby -rpublic_suffix -e 'PublicSuffix::List.default'
/usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `count': invalid byte sequence in US-ASCII (ArgumentError)
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:128:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:119:in `build'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/rule.rb:334:in `factory'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:94:in `block (2 levels) in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `each_line'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:75:in `block in parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:108:in `initialize'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `new'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:74:in `parse'
	from /usr/local/bundle/gems/public_suffix-4.0.6/lib/public_suffix/list.rb:51:in `default'
	from -e:1:in `<main>'
root@31cd6631fcaa:/# LANG=C.UTF-8 ruby -rpublic_suffix -e 'PublicSuffix::List.default'

dentarg avatar Feb 05 '21 12:02 dentarg

I'm encountering an error that is probably related to this:

domain = PublicSuffix.domain(request.host)
Tenant.find_by!(domain: domain)

Raises: ArgumentError (Cannot transliterate strings with ASCII-8BIT encoding)

Forcing UTF-8 works:

domain = PublicSuffix.domain(host).to_s.force_encoding('UTF-8')

Ruby: 3.0.0 Rails: 6.1.3 Gem: 4.0.6

zavan avatar Feb 19 '21 14:02 zavan

Two workarounds below.

  1. Set the encoding using the Ruby interpreter's -E flag:
ruby -E utf-8 ./foo.rb
  1. Set the external encoding progamatically:
require 'public_suffix'

Encoding.default_external = 'utf-8'
puts PublicSuffix.parse('example.com').inspect

mcarpenter avatar May 03 '24 10:05 mcarpenter