CSV.jl
CSV.jl copied to clipboard
Don't warn on dropped last column with trailing delimiter
When reading a file with e.g. 7 columns that ends each row on a delimiter, a warning is printed that only n-1/n columns were read, even when dropping that last empty column.
MWE
CSV:
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
julia> data = CSV.File("example.csv", delim=';', header=[:a, :b, :c,:d,:e,:f,:g], drop=[:g], strict=true)
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 3. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 4. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 5. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
5-element CSV.File{false}:
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
In this example though, the g
columns actually corresponds to the last test1
values. So you're telling the parser to drop the test1
values, and then it encounters the last ;
delimiter and then emits the warning that it expected to reach the end of the row, but didn't, and any remaining values on the row will be ignored. For example, if I do:
julia> f = CSV.File(IOBuffer(csv); delim=';', header=[:a, :b, :c, :d, :e, :f, :g, :h], drop=[:h], strict=true)
5-element CSV.File{false}:
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
I don't see any warnings/errors.
Ah, right - that's weird 🤔 I definitely get this warning on my real data though (sadly can't share it :/)
julia> f = readlines("meinelba_utf-8.csv");
julia> map(x -> count(==(';'), x), f) |> unique
1-element Array{Int64,1}:
6
That should mean that there's a bug somewhere, right?
And if I ignore the last column completely, I get this warning for the first 101 rows:
julia> data = CSV.File("meinelba_utf-8.csv", delim=';', decimal=',', strict=true, header=[:Buchungsdatum, :Buchungstext, :Valutadatum, :Betrag, :Währung, :Durchführungszeitpunkt])
┌ Warning: thread = 1 warning: parsed expected 6 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
I suspect the warning for the other rows also exists, but isn't printed since they're handled by a different thread?
julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
JULIA_PKG_SERVER =
JULIA_NUM_THREADS = 4
I too have data that has a trailing delimiter, and want to drop the last column. I generated some data in the file "Book1.csv" as follows:
1,1,1, 2,4,8, 3,9,27,
I can read this in with the following command
a = CSV.File("Book1.csv", drop=[4])
I would like to make this more generic by using
a = CSV.File("Book1.csv", drop=[end])
Is is possible to add this capability to the drop and select commands in CSV? Currently my work around is to generate a DataFrame to eliminate the last column.
In addition drop=[1,2,3] works but drop=[1:3] does not work.
In addition drop=[1,2,3] works but drop=[1:3] does not work.
Note that drop=[1:3]
creates a Vector{UnitRange}
, i.e. it doesn't expand the range. This would work drop=collect(1:3)
, or drop=1:3
.
Hmmm, I do see that it would be useful to support something like drop=[end]
, the question is how to make that work. Maybe drop=[-1]
, where each negative number would count from the end of hte array?