creating a `DTable` eagerly copy all entries?
julia> struct A <: AbstractVector{Int}
v::Int
end
julia> Base.size(::A) = (100, 1)
julia> Base.getindex(a::A, ::Int) = (println("??"); a.v)
julia> using DTables
julia> dt = DTable((a=A(3),), 10);
??
??
??
??
but this seems to be fine?
julia> dt = DTable((a=A(3),));
julia>
Naive debugging during a talk :)
I think the critical line in DTables.jl is https://github.com/JuliaParallel/DTables.jl/blob/main/src/table/dtable.jl#L109. If I didn't make a debugging mistake, it jumps to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/namedtuples.jl#L187 and then [...] to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/fallbacks.jl#L265 which is
CopiedColumns(buildcolumns(schema(r), r))
When we do DTable((a=A(3),)); we have
typeof(partition) == NamedTuple{(:a,), Tuple{A}}
go into the sink call, which leads to https://github.com/JuliaData/Tables.jl/blob/main/src/namedtuples.jl#L192.
For DTable((a=A(3),), 10); OTOH we have
typeof(inner_partition) = TableOperations.ColumnPartitionRows{NamedTuple{(:a,), Tuple{A}}}
which leads to the fallback that I linked the post above.
~~I think it's reasonable for partition to allocate, closing as not actionable.~~
so DTable calls table operation, maybe it should not materialize like this?
The simple answer without going into details is that if you use the constructors that assume the partitioning of the input then it will not copy the input (at least it shouldn't)
When you use the constructors that take the chunksize arg then the constructor will be partitioning the data for you and ignore the partitioning of the input (fyi it will still use the partitions interface).
With these you can also control whether you want to merge rows between the partitions of the input, which might be helpful when your partitions are huge and you don't want two of them loaded at the same time
The above work on Tables/TableOperations, so they fallback to copies eventually even if the code doesn't really copy it explicitly (not aware if this can be improved)
I'm not super happy with the constructors for DTable. There's a lot of cool things we could do to make it more efficient when using different input sources. As of right now it's made to be universal and take any Tables.jl input Good news is that they're not very tightly coupled with the rest of the code and if you really need to you can write your own constructor pretty easily, so we can extend/replace this without any concerns