CSV.jl
CSV.jl copied to clipboard
CSV.read not working properly with StructArray
It works as expected when reading into a NamedTulpe first, but not when reading directly into a StructArray:
julia> write("/tmp/a.csv", "a,b,c\n1,2,3\n4,5,6\n");
julia> CSV.read("/tmp/a.csv", StructArrays.StructArray)
0-element StructArray() with eltype Vector{Int64} with indices 1:0
julia> StructArrays.StructArray(CSV.read("/tmp/a.csv", NamedTuple))
2-element StructArray(::Vector{Int64}, ::Vector{Int64}, ::Vector{Int64}) with eltype NamedTuple{(:a, :b, :c), Tuple{Int64, Int64, Int64}}:
(a = 1, b = 2, c = 3)
(a = 4, b = 5, c = 6)
julia> Pkg.status("CSV")
Status `/me/.julia/environments/v1.8/Project.toml`
[336ed68f] CSV v0.9.10
Pinging @piever to chime in, but as I understand it, a StructArray is a valid Tables.jl source, but doesn't currently have a sink interface. i.e. the generic fallback StructArray(x) definitions expects x to be a struct iterator, but not necessarily a Tables.jl-related source.
My guess is it would be trivial to add some kind of StructArrays.fromtable function that would take any valid Tables.jl source and create a StructArray, if you wanted to take a stab at a PR.
Yes, that is correct. You can actually already do
CSV.read(fn, StructArray∘Tables.columntable)
which is probably close to optimal (the data structure backing a StructArray is a named tuple of vectors).
Should CSV.read call materializer(sink) rather than sink to process the data? That way, I can just define the correct
materializer(::Type{<:StructArray})
method, and CSV.read(fn, StructArray) would just work.
It seems like StructArray∘Tables.columntable is the way to go to me. CSV.read is pretty well documented that you need to pass a valid sink function; materializer is usually just for when you pass in an instance of a table and want to get its materializing function.
I think most users who aren't familiar with the implementation details are still going to expect StructArray to work as a sink without going through an intermediate step. It's a table-like data structure just like DataFrame.
CSV.readis pretty well documented that you need to pass a valid sink function;
Makes sense. I can probably still define materializer for a StructArray type, so that users can do
CSV.read(fn, materializer(StructArray))
Rather than defining CSV.read(fn, sink::Type) as CSV.File(fn) |> materializer(sink) (it may be a step too far) maybe the docs of CSV.read could briefly mention the materializer helper to get a valid sink function from an instance or type (I think both work) of table.