readdlm not working with white spaces
readdlm ignores all-white spaces by default when a delimiter is not specified. However, when one wants to specify the data type to be read it is obligatory to specify the delimiter too...
readdlm(source, delim::AbstractChar, T::Type, eol::AbstractChar; header=false, skipstart=0, skipblanks=true, use_mmap, quotes=true, dims, comments=false, comment_char='#')
Then, in the following case,
readdlm(file, ' ', Float64, comments=true)
the function doesn't ignore the initial whitespace because the delimiter is ' ', only 1 whitespace. Then the program crashes with for example
2 3
1 3
There should be a flag to ignore all chars that match with the delimiter or just be able to specify the type like this
readdlm(file, type=Float64, comments=true)
however this brings the problem that if the delimiter is not a whitespace the problem will persist.
Well it shouldn't crash... but at least there's a workaround (especially useful for larger files): https://github.com/JuliaData/CSV.jl
Can you provide an example file and invocation that exhibits this crash?
@StefanKarpinski Shure,
The file content is
$ cat test.txt
1 2
3 4
Note that the first character is a whitespace. Using this invocation
using DelimitedFiles
file="test.txt"
data=readdlm(file, ' ', Float64,)
I get the following error
at row 1, column 1 : ErrorException("file entry \"\" cannot be converted to Float64")
Stacktrace:
[1] error(::String) at ./error.jl:33
[2] dlm_fill(::DataType, ::Array{Array{Int64,1},1}, ::Tuple{Int64,Int64}, ::Bool, ::String, ::Bool, ::Char) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:514
[3] readdlm_string(::String, ::Char, ::Type{T} where T, ::Char, ::Bool, ::Dict{Symbol,Union{Char, Integer, Tuple{Integer,Integer}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:470
[4] readdlm_auto(::String, ::Char, ::Type{T} where T, ::Char, ::Bool; opts::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:244
[5] readdlm_auto at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:233 [inlined]
[6] #readdlm#6 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
[7] readdlm at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:225 [inlined]
[8] #readdlm#2 at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86 [inlined]
[9] readdlm(::String, ::Char, ::Type{T} where T) at /build/julia-k44EjI/julia-1.5.3+dfsg/usr/share/julia/stdlib/v1.5/DelimitedFiles/src/DelimitedFiles.jl:86
[10] top-level scope at In[4]:2
[11] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091
This invocation provides something I dint spect
data=readdlm(file, ' ')
2×3 Array{Any,2}:
"" 1 2
"" 3 4
And the one that works is
data=readdlm(file)
2×2 Array{Float64,2}:
1.0 2.0
3.0 4.0
This is a minimal working example, but it seems to happen in more complicated cases too.
That's not what we would call a "crash", it's an error message indicating that the file doesn't have valid formatting. The error message isn't great (it reads an empty field before the leading space and then cannot convert that to float), but readdlm is also effectively deprecated and the CSV package should be used.
I guess it needs an option to skip empty fields?
One option that would be very helpful is to skip empty fields as @JeffBezanson said or, which I think it is better, to consider multiple delimiter chars as one single separator. There are some space index files that has this kind of format:
1997 8 2450457.0 73.8 75.8 78.7 78.7 74.0 73.0 67.9 68.8 1B11
1997 9 2450458.0 73.7 75.6 79.9 78.7 74.7 72.9 66.9 68.6 1B11
1997 10 2450459.0 75.4 75.4 80.3 78.7 76.3 72.8 70.5 68.3 1B11
DelimitedFiles.jl seems to be not capable of parsing it correctly. However, if there is an option like skip_multiple_delims or something, it could! If you accept this proposal, I can submit a PR!
I didn't check, but if this works with CSV.jl and/or (the maybe less known) DLMReader.jl, then maybe not bother implement this (and document both, at least those of where this works)? They at least load fast now:
julia> @time using CSV
0.698591 seconds (691.91 k allocations: 45.800 MiB, 14.11% gc time, 2.74% compilation time)
julia> @time using DLMReader
┌ Warning: Julia started with single thread, to enable multithreaded functionalities in InMemoryDatasets.jl start Julia with multiple threads.
└ @ InMemoryDatasets ~/.julia/packages/InMemoryDatasets/60HVD/src/InMemoryDatasets.jl:205
1.905116 seconds (2.06 M allocations: 131.414 MiB, 8.09% gc time, 12.93% compilation time: 88% of which was recompilation)
Slightly faster with:
$ julia -t auto
julia> @time using DLMReader
1.685144 seconds (1.90 M allocations: 120.533 MiB, 7.65% gc time, 1.26% compilation time)
CSV.jl took slightly longer with auto though. Maybe a fluke.