Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

Does Tables have the infrastructure for a performant map that returns vectors of tuples?

Open pdeffebach opened this issue 3 years ago • 2 comments

I frequently want to use map to make a table, and do

t = map(1:1000) do _
    x = rand()
    y = rand()
    (; x, y)
end |> Tables.columntable # named tuple of vectors, not vector of named tuples

This seems wasteful. I would rather have some function mapv which returns tuples of vectors rather than vectors of tuples. i.e. a function which inspects the first result, allocates vectors of the appropriate size (or something else if the size is not known), and then adds to the vectors at the appropriate indices.

Writing a package for this sounds pretty hard, to be honest. There are likely lots of performance considerations. Has Tables.jl already solved these problems? If I were to write a package for this, would it be able to be just wrappers around the Tables API?

Thanks

pdeffebach avatar Apr 07 '21 16:04 pdeffebach

Doing

rows = ((x=rand(), y=rand()) for i = 1:1000)
@time columntable(rows)

Is pretty efficient; it avoids allocating the individual NamedTuples for the majority of cases because it can figure out that you're just going to put the individual elements into the resulting vectors.

The ideal efficiency requires a schema, so doing something like the following would get there; it's a bit onerous to have to specify the schema, but maybe there'd be a way to make it a little easier somehow. I've kind of thought something like WithSchema might be generally helpful for cases when people know the schema but are dealing w/ a schema-less table for some reason.

struct WithSchema{S, T}
    sch::S # Tables.Schema
    source::T
end

Tables.isrowtable(::Type{<:WithSchema}) = true
Tables.rows(x::WithSchema) = x
Tables.schema(x::WithSchema) = x.sch
Base.iterate(x::WithSchema) = iterate(x.source)
Base.iterate(x::WithSchema, st) = iterate(x.source, st)
Base.IteratorSize(::Type{WithSchema{T}}) where {T} = Base.IteratorSize(T)
Base.length(x::WithSchema) = length(x.source)

rows = ((x=rand(), y=rand()) for i = 1:1000)
ws = WithSchema(Tables.Schema((:x, :y), (Float64, Float64)), rows)
@time columntable(ws)

quinnj avatar Apr 07 '21 20:04 quinnj

Stumbled across this issue by chance, but if the question is still relevant - there is a direct solution already. Just use structarrays!

map(StructArray(i=1:1000)) do _
       x = rand()
       y = rand()
       (; x, y)
end

returns a structarray, that is basically a tuple of two vectors. This is the most efficient way, only 2 allocations and barely any overhead.

aplavin avatar Aug 03 '22 10:08 aplavin