DelimitedFiles.jl icon indicating copy to clipboard operation
DelimitedFiles.jl copied to clipboard

readdlm not working with white spaces

Open cdelv opened this issue 3 years ago • 7 comments

readdlm ignores all-white spaces by default when a delimiter is not specified. However, when one wants to specify the data type to be read it is obligatory to specify the delimiter too...

readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')

Then, in the following case,

readdlm(file, ' ', Float64, comments=true)

the function doesn't ignore the initial whitespace because the delimiter is ' ', only 1 whitespace. Then the program crashes with for example

 2 3
1 3

There should be a flag to ignore all chars that match with the delimiter or just be able to specify the type like this

readdlm(file, type=Float64, comments=true)

however this brings the problem that if the delimiter is not a whitespace the problem will persist.

cdelv avatar Apr 03 '22 18:04 cdelv

Well it shouldn't crash... but at least there's a workaround (especially useful for larger files): https://github.com/JuliaData/CSV.jl

PallHaraldsson avatar Apr 04 '22 09:04 PallHaraldsson

Can you provide an example file and invocation that exhibits this crash?

StefanKarpinski avatar Apr 05 '22 15:04 StefanKarpinski

@StefanKarpinski Shure,

The file content is

$ cat test.txt 
 1 2
 3 4

Note that the first character is a whitespace. Using this invocation

using DelimitedFiles

file="test.txt"
data=readdlm(file, ' ', Float64,)

I get the following error

at row 1, column 1 : ErrorException("file entry \"\" cannot be converted to Float64")

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] dlm_fill(::DataType, ::Array{Array{Int64,1},1}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:514
 [3] readdlm_string(::String, ::Char, ::Type{T} where T, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:470
 [4] readdlm_auto(::String, ::Char, ::Type{T} where T, ::Char, ::Bool; opts::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:244
 [5] readdlm_auto at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:233 [inlined]
 [6] #readdlm#6 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
 [7] readdlm at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
 [8] #readdlm#2 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86 [inlined]
 [9] readdlm(::String, ::Char, ::Type{T} where T) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86
 [10] top-level scope at In[4]:2
 [11] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091

This invocation provides something I dint spect

data=readdlm(file, ' ')
2×3 Array{Any,2}:
 ""  1  2
 ""  3  4

And the one that works is

data=readdlm(file)
2×2 Array{Float64,2}:
 1.0  2.0
 3.0  4.0

This is a minimal working example, but it seems to happen in more complicated cases too.

cdelv avatar Apr 05 '22 16:04 cdelv

That's not what we would call a "crash", it's an error message indicating that the file doesn't have valid formatting. The error message isn't great (it reads an empty field before the leading space and then cannot convert that to float), but readdlm is also effectively deprecated and the CSV package should be used.

StefanKarpinski avatar Apr 05 '22 16:04 StefanKarpinski

I guess it needs an option to skip empty fields?

JeffBezanson avatar Apr 06 '22 17:04 JeffBezanson

One option that would be very helpful is to skip empty fields as @JeffBezanson said or, which I think it is better, to consider multiple delimiter chars as one single separator. There are some space index files that has this kind of format:

  1997   8   2450457.0  73.8  75.8  78.7  78.7  74.0  73.0  67.9  68.8  1B11
  1997   9   2450458.0  73.7  75.6  79.9  78.7  74.7  72.9  66.9  68.6  1B11
  1997  10   2450459.0  75.4  75.4  80.3  78.7  76.3  72.8  70.5  68.3  1B11

DelimitedFiles.jl seems to be not capable of parsing it correctly. However, if there is an option like skip_multiple_delims or something, it could! If you accept this proposal, I can submit a PR!

ronisbr avatar Apr 14 '23 14:04 ronisbr

I didn't check, but if this works with CSV.jl and/or (the maybe less known) DLMReader.jl, then maybe not bother implement this (and document both, at least those of where this works)? They at least load fast now:

julia> @time using CSV
  0.698591 seconds (691.91 k allocations: 45.800 MiB, 14.11% gc time, 2.74% compilation time)

julia> @time using DLMReader
┌ Warning: Julia started with single thread, to enable multithreaded functionalities in InMemoryDatasets.jl start Julia with multiple threads.
└ @ InMemoryDatasets ~/.julia/packages/InMemoryDatasets/60HVD/src/InMemoryDatasets.jl:205
  1.905116 seconds (2.06 M allocations: 131.414 MiB, 8.09% gc time, 12.93% compilation time: 88% of which was recompilation)


Slightly faster with:

$ julia -t auto

julia> @time using DLMReader
  1.685144 seconds (1.90 M allocations: 120.533 MiB, 7.65% gc time, 1.26% compilation time)

CSV.jl took slightly longer with auto though. Maybe a fluke.

PallHaraldsson avatar Apr 26 '23 14:04 PallHaraldsson