CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

Error when reading csv in non UTF-8 encoding

Open etibarg opened this issue 2 years ago • 3 comments

Hello,

According to https://docs.juliahub.com/CSV/HHBkp/0.10.4/examples.html#stringencodings, reading csv encoded in non-utf-8 format is done using StringEncodings. Example code provided:

file = CSV.File(open("iso8859_encoded_file.csv", enc"ISO-8859-1"))

However, I can't make it work as suggested:

julia> file = CSV.File(open("EBNM_cropped_im\\output_detection_EBNM.csv", enc"UCS-2LE"))
ERROR: MethodError: no method matching readavailable(::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
Closest candidates are:
  readavailable(::Base.AbstractPipe) at io.jl:427
  readavailable(::Base.GenericIOBuffer) at iobuffer.jl:467
  readavailable(::Base.LibuvStream) at stream.jl:983
  ...
Stacktrace:
 [1] write(to::TranscodingStreams.NoopStream{IOStream}, from::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
   @ Base .\io.jl:753
 [2] buffer_to_tempfile
   @ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:311 [inlined]
 [3] getbytebuffer(x::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream}, buffer_in_memory::Bool)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:268
 [4] getsource(x::Any, buffer_in_memory::Bool)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:288
 [5] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\context.jl:304
 [6] #File#25
   @ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:221 [inlined]
 [7] CSV.File(source::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
   @ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:162
 [8] top-level scope
   @ REPL[14]:1

Notes:

  • According to notepad++, encoding is "UTF-16 LE BOM" / "UCS-2 LE"
  • Still in notepad++, when I change encoding to "UTF-8", I can read the file.

Versioninfo:

julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
  Threads: 8 on 16 virtual cores
Environment:
  JULIA_PKG_DEVDIR = C:/Users/XXX/Devdir
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8

etibarg avatar Sep 02 '22 13:09 etibarg

Bummer; it seems like the StringDecoder object doesn't implement the expected IO interface we're relying on internally.

quinnj avatar Sep 02 '22 13:09 quinnj

A work around would be to do file = CSV.File(read(open("iso8859_encoded_file.csv", enc"ISO-8859-1")))

quinnj avatar Sep 02 '22 13:09 quinnj

It works, thanks!

etibarg avatar Sep 02 '22 13:09 etibarg

I had the same problem.

Maybe update the examples?

CSV.File docstring already suggest similar work around:

https://csv.juliadata.org/stable/reading.html#CSV.File

For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g. CSV.File(open(read, input, enc"ISO-8859-1")).

cirocavani avatar Oct 31 '22 17:10 cirocavani

I've implemented a fix at https://github.com/JuliaStrings/StringEncodings.jl/pull/53.

Though @quinnj do you think CSV.jl could use a more efficient approach? My implementation of readavailable returns copies of the data (by batches of 200 bytes), as the function is documented to return a Vector{UInt8} (otherwise I would return a SubArray). But that copy is only used by the write(::IO, ::IO) fallback to write its contents to another stream so it's really wasteful. On the contrary, StringDecoder has an efficient implementation of readbytes!, which isn't used here. Do you know a better API buffer_to_tempfile could use? Or maybe that's a problem in Julia and the write(::IO, ::IO) fallback should be improved, or maybe even readavailable changed to support returning views?

nalimilan avatar Jan 07 '23 23:01 nalimilan

I'm releasing a new StringEncodings version to fix this, but it would still be interesting to check whether a more efficient solution could be implemented @quinnj.

nalimilan avatar Jan 24 '23 10:01 nalimilan

Bump @quinnj.

nalimilan avatar Jun 29 '23 12:06 nalimilan

Ah, sorry for the slow response. Yeah, @Drvi and I have been jamming on some bigger-picture refactorings that will eventually make there way here, and yes, in the new model, we're using readbytes! instead.

quinnj avatar Jun 30 '23 01:06 quinnj