CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

Don't warn on dropped last column with trailing delimiter

Open Seelengrab opened this issue 4 years ago • 6 comments

When reading a file with e.g. 7 columns that ends each row on a delimiter, a warning is printed that only n-1/n columns were read, even when dropping that last empty column.

MWE

CSV:

test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
test1;test2;test1;test2;test1;test2;test1;
julia> data = CSV.File("example.csv", delim=';', header=[:a, :b, :c,:d,:e,:f,:g], drop=[:g], strict=true)
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 2. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 3. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 4. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
┌ Warning: thread = 1 warning: parsed expected 7 columns, but didn't reach end of line around data row: 5. Ignoring any extra columns on this row
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604
5-element CSV.File{false}:
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2")

Seelengrab avatar Dec 16 '20 18:12 Seelengrab

In this example though, the g columns actually corresponds to the last test1 values. So you're telling the parser to drop the test1 values, and then it encounters the last ; delimiter and then emits the warning that it expected to reach the end of the row, but didn't, and any remaining values on the row will be ignored. For example, if I do:

julia> f = CSV.File(IOBuffer(csv); delim=';', header=[:a, :b, :c, :d, :e, :f, :g, :h], drop=[:h], strict=true)
5-element CSV.File{false}:
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")
 CSV.Row: (a = "test1", b = "test2", c = "test1", d = "test2", e = "test1", f = "test2", g = "test1")

I don't see any warnings/errors.

quinnj avatar Dec 16 '20 22:12 quinnj

Ah, right - that's weird 🤔 I definitely get this warning on my real data though (sadly can't share it :/)

Seelengrab avatar Dec 17 '20 06:12 Seelengrab

image

julia> f = readlines("meinelba_utf-8.csv");

julia> map(x -> count(==(';'), x), f) |> unique
1-element Array{Int64,1}:
 6

That should mean that there's a bug somewhere, right?

Seelengrab avatar Dec 17 '20 07:12 Seelengrab

And if I ignore the last column completely, I get this warning for the first 101 rows:

julia> data = CSV.File("meinelba_utf-8.csv", delim=';', decimal=',', strict=true, header=[:Buchungsdatum, :Buchungstext, :Valutadatum, :Betrag, :Währung, :Durchführungszeitpunkt])
┌ Warning: thread = 1 warning: parsed expected 6 columns, but didn't reach end of line around data row: 1. Ignoring any extra columns on this row                                 
└ @ CSV ~/.julia/packages/CSV/la2cd/src/file.jl:604                                                                                                                               

I suspect the warning for the other rows also exists, but isn't printed since they're handled by a different thread?

julia> versioninfo()
Julia Version 1.5.3
Commit 788b2c77c1 (2020-11-09 13:37 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6600U CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_PKG_SERVER =
  JULIA_NUM_THREADS = 4

Seelengrab avatar Dec 17 '20 07:12 Seelengrab

I too have data that has a trailing delimiter, and want to drop the last column. I generated some data in the file "Book1.csv" as follows:

1,1,1, 2,4,8, 3,9,27,

I can read this in with the following command

a = CSV.File("Book1.csv", drop=[4])

I would like to make this more generic by using

a = CSV.File("Book1.csv", drop=[end])

Is is possible to add this capability to the drop and select commands in CSV? Currently my work around is to generate a DataFrame to eliminate the last column.

In addition drop=[1,2,3] works but drop=[1:3] does not work.

JakeZw avatar Dec 18 '20 02:12 JakeZw

In addition drop=[1,2,3] works but drop=[1:3] does not work.

Note that drop=[1:3] creates a Vector{UnitRange}, i.e. it doesn't expand the range. This would work drop=collect(1:3), or drop=1:3.

Hmmm, I do see that it would be useful to support something like drop=[end], the question is how to make that work. Maybe drop=[-1], where each negative number would count from the end of hte array?

quinnj avatar Aug 20 '21 06:08 quinnj