DTables.jl icon indicating copy to clipboard operation
DTables.jl copied to clipboard

creating a `DTable` eagerly copy all entries?

Open Moelf opened this issue 2 years ago • 6 comments

julia> struct A <: AbstractVector{Int}
           v::Int
       end
julia> Base.size(::A) = (100, 1)

julia> Base.getindex(a::A, ::Int) = (println("??"); a.v)

julia> using DTables

julia> dt = DTable((a=A(3),), 10);
??
??
??
??

Moelf avatar Nov 08 '23 14:11 Moelf

Came up during the Julia HEP 2023 workshop.

(cc @grasph)

carstenbauer avatar Nov 08 '23 14:11 carstenbauer

but this seems to be fine?

julia> dt = DTable((a=A(3),));

julia>

Moelf avatar Nov 08 '23 14:11 Moelf

Naive debugging during a talk :)

I think the critical line in DTables.jl is https://github.com/JuliaParallel/DTables.jl/blob/main/src/table/dtable.jl#L109. If I didn't make a debugging mistake, it jumps to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/namedtuples.jl#L187 and then [...] to https://github.com/JuliaData/Tables.jl/blob/175e431eadaae9e439b5a8e020afce06dc6cf4f4/src/fallbacks.jl#L265 which is

CopiedColumns(buildcolumns(schema(r), r))

carstenbauer avatar Nov 08 '23 15:11 carstenbauer

When we do DTable((a=A(3),)); we have

typeof(partition) == NamedTuple{(:a,), Tuple{A}}

go into the sink call, which leads to https://github.com/JuliaData/Tables.jl/blob/main/src/namedtuples.jl#L192.

For DTable((a=A(3),), 10); OTOH we have

typeof(inner_partition) = TableOperations.ColumnPartitionRows{NamedTuple{(:a,), Tuple{A}}}

which leads to the fallback that I linked the post above.

carstenbauer avatar Nov 08 '23 15:11 carstenbauer

~~I think it's reasonable for partition to allocate, closing as not actionable.~~

so DTable calls table operation, maybe it should not materialize like this?

Moelf avatar Nov 08 '23 16:11 Moelf

The simple answer without going into details is that if you use the constructors that assume the partitioning of the input then it will not copy the input (at least it shouldn't) When you use the constructors that take the chunksize arg then the constructor will be partitioning the data for you and ignore the partitioning of the input (fyi it will still use the partitions interface). With these you can also control whether you want to merge rows between the partitions of the input, which might be helpful when your partitions are huge and you don't want two of them loaded at the same time

The above work on Tables/TableOperations, so they fallback to copies eventually even if the code doesn't really copy it explicitly (not aware if this can be improved)

I'm not super happy with the constructors for DTable. There's a lot of cool things we could do to make it more efficient when using different input sources. As of right now it's made to be universal and take any Tables.jl input Good news is that they're not very tightly coupled with the rest of the code and if you really need to you can write your own constructor pretty easily, so we can extend/replace this without any concerns

krynju avatar Nov 08 '23 19:11 krynju