CSV.jl icon indicating copy to clipboard operation
CSV.jl copied to clipboard

CSV.read not working properly with StructArray

Open robsmith11 opened this issue 3 years ago • 5 comments

It works as expected when reading into a NamedTulpe first, but not when reading directly into a StructArray:

julia> write("/tmp/a.csv", "a,b,c\n1,2,3\n4,5,6\n");

julia> CSV.read("/tmp/a.csv", StructArrays.StructArray)
0-element StructArray() with eltype Vector{Int64} with indices 1:0

julia> StructArrays.StructArray(CSV.read("/tmp/a.csv", NamedTuple))
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
 (a = 1, b = 2, c = 3)
 (a = 4, b = 5, c = 6)

julia> Pkg.status("CSV")
Status `/me/.julia/environments/v1.8/Project.toml`
  [336ed68f] CSV v0.9.10

robsmith11 avatar Nov 13 '21 02:11 robsmith11

Pinging @piever to chime in, but as I understand it, a StructArray is a valid Tables.jl source, but doesn't currently have a sink interface. i.e. the generic fallback StructArray(x) definitions expects x to be a struct iterator, but not necessarily a Tables.jl-related source.

My guess is it would be trivial to add some kind of StructArrays.fromtable function that would take any valid Tables.jl source and create a StructArray, if you wanted to take a stab at a PR.

quinnj avatar Nov 13 '21 05:11 quinnj

Yes, that is correct. You can actually already do

CSV.read(fn, StructArray∘Tables.columntable)

which is probably close to optimal (the data structure backing a StructArray is a named tuple of vectors).

Should CSV.read call materializer(sink) rather than sink to process the data? That way, I can just define the correct

materializer(::Type{<:StructArray})

method, and CSV.read(fn, StructArray) would just work.

piever avatar Nov 15 '21 11:11 piever

It seems like StructArray∘Tables.columntable is the way to go to me. CSV.read is pretty well documented that you need to pass a valid sink function; materializer is usually just for when you pass in an instance of a table and want to get its materializing function.

quinnj avatar Nov 16 '21 04:11 quinnj

I think most users who aren't familiar with the implementation details are still going to expect StructArray to work as a sink without going through an intermediate step. It's a table-like data structure just like DataFrame.

robsmith11 avatar Nov 16 '21 07:11 robsmith11

CSV.read is pretty well documented that you need to pass a valid sink function;

Makes sense. I can probably still define materializer for a StructArray type, so that users can do

CSV.read(fn, materializer(StructArray))

Rather than defining CSV.read(fn, sink::Type) as CSV.File(fn) |> materializer(sink) (it may be a step too far) maybe the docs of CSV.read could briefly mention the materializer helper to get a valid sink function from an instance or type (I think both work) of table.

piever avatar Nov 16 '21 09:11 piever