chardet2 icon indicating copy to clipboard operation
chardet2 copied to clipboard

I got ArgumentError: invalid byte sequence in UTF-8 in ruby 1.9.3 while trying to detect a 'ISO-8859-1' encoded csv file.

Open orbanbotond opened this issue 11 years ago • 5 comments

ArgumentError: invalid byte sequence in UTF-8 from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:in =~' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:infeed' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:46:in `chardet'

orbanbotond avatar May 24 '13 10:05 orbanbotond

Can you attach the csv file?

janx avatar May 24 '13 10:05 janx

The file is 35 Mbytes huge. I will try it to make it smaller.

orbanbotond avatar May 24 '13 10:05 orbanbotond

@orbanbotond I cannot reproduce on my ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux], here's my test script:

require 'UniversalDetector'

data = File.open('Insight_Extract_11-04-2013-a.csv', 'rb').read
p UniversalDetector.chardet(data)

The output is {"encoding"=>"ISO-8859-2", "confidence"=>0.7616471388020385}.

janx avatar May 25 '13 03:05 janx

Well... at such a huge file it took me forever to run.... I haven't got any result.

How long did it take at you to get the result for the detection?

orbanbotond avatar May 28 '13 06:05 orbanbotond

I can't remember the exact number, 5-10 mins I guess.

janx avatar May 28 '13 06:05 janx