CSV.jl
CSV.jl copied to clipboard
Error when reading csv in non UTF-8 encoding
Hello,
According to https://docs.juliahub.com/CSV/HHBkp/0.10.4/examples.html#stringencodings, reading csv encoded in non-utf-8 format is done using StringEncodings
.
Example code provided:
file = CSV.File(open("iso8859_encoded_file.csv", enc"ISO-8859-1"))
However, I can't make it work as suggested:
julia> file = CSV.File(open("EBNM_cropped_im\\output_detection_EBNM.csv", enc"UCS-2LE"))
ERROR: MethodError: no method matching readavailable(::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
Closest candidates are:
readavailable(::Base.AbstractPipe) at io.jl:427
readavailable(::Base.GenericIOBuffer) at iobuffer.jl:467
readavailable(::Base.LibuvStream) at stream.jl:983
...
Stacktrace:
[1] write(to::TranscodingStreams.NoopStream{IOStream}, from::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
@ Base .\io.jl:753
[2] buffer_to_tempfile
@ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:311 [inlined]
[3] getbytebuffer(x::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream}, buffer_in_memory::Bool)
@ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:268
[4] getsource(x::Any, buffer_in_memory::Bool)
@ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\utils.jl:288
[5] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
@ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\context.jl:304
[6] #File#25
@ C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:221 [inlined]
[7] CSV.File(source::StringDecoder{Encoding{Symbol("UCS-2LE")}, Encoding{Symbol("UTF-8")}, IOStream})
@ CSV C:\Users\XXX\.julia\packages\CSV\jFiCn\src\file.jl:162
[8] top-level scope
@ REPL[14]:1
Notes:
- According to notepad++, encoding is "UTF-16 LE BOM" / "UCS-2 LE"
- Still in notepad++, when I change encoding to "UTF-8", I can read the file.
Versioninfo:
julia> versioninfo()
Julia Version 1.8.0
Commit 5544a0fab7 (2022-08-17 13:38 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, tigerlake)
Threads: 8 on 16 virtual cores
Environment:
JULIA_PKG_DEVDIR = C:/Users/XXX/Devdir
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8
Bummer; it seems like the StringDecoder
object doesn't implement the expected IO interface we're relying on internally.
A work around would be to do file = CSV.File(read(open("iso8859_encoded_file.csv", enc"ISO-8859-1")))
It works, thanks!
I had the same problem.
Maybe update the examples?
CSV.File docstring already suggest similar work around:
https://csv.juliadata.org/stable/reading.html#CSV.File
For text encodings other than UTF-8, load the StringEncodings.jl package and call e.g.
CSV.File(open(read, input, enc"ISO-8859-1"))
.
I've implemented a fix at https://github.com/JuliaStrings/StringEncodings.jl/pull/53.
Though @quinnj do you think CSV.jl could use a more efficient approach? My implementation of readavailable
returns copies of the data (by batches of 200 bytes), as the function is documented to return a Vector{UInt8}
(otherwise I would return a SubArray
). But that copy is only used by the write(::IO, ::IO)
fallback to write its contents to another stream so it's really wasteful. On the contrary, StringDecoder
has an efficient implementation of readbytes!
, which isn't used here. Do you know a better API buffer_to_tempfile
could use? Or maybe that's a problem in Julia and the write(::IO, ::IO)
fallback should be improved, or maybe even readavailable
changed to support returning views?
I'm releasing a new StringEncodings version to fix this, but it would still be interesting to check whether a more efficient solution could be implemented @quinnj.
Bump @quinnj.
Ah, sorry for the slow response. Yeah, @Drvi and I have been jamming on some bigger-picture refactorings that will eventually make there way here, and yes, in the new model, we're using readbytes!
instead.