chardet2
chardet2 copied to clipboard
I got ArgumentError: invalid byte sequence in UTF-8 in ruby 1.9.3 while trying to detect a 'ISO-8859-1' encoded csv file.
ArgumentError: invalid byte sequence in UTF-8
from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:in =~' from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:134:in
feed'
from /Users/boti/.rvm/gems/ruby-1.9.3-p327@search_server/gems/chardet2-1.0.1/lib/UniversalDetector.rb:46:in `chardet'
Can you attach the csv file?
The file is 35 Mbytes huge. I will try it to make it smaller.
@orbanbotond I cannot reproduce on my ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-linux]
, here's my test script:
require 'UniversalDetector'
data = File.open('Insight_Extract_11-04-2013-a.csv', 'rb').read
p UniversalDetector.chardet(data)
The output is {"encoding"=>"ISO-8859-2", "confidence"=>0.7616471388020385}
.
Well... at such a huge file it took me forever to run.... I haven't got any result.
How long did it take at you to get the result for the detection?
I can't remember the exact number, 5-10 mins I guess.