PrettyTables.jl icon indicating copy to clipboard operation
PrettyTables.jl copied to clipboard

Following the comment in discourse

Open sl-solution opened this issue 2 years ago • 8 comments

Thank you for supporting the development of InMemoryDatasets.jl.

I added a comment to your post in discourse (here), however, I thought it might be better to discuss it here.

Currently printing a large view of a data set is very slow, I wondering if this can be improved?

sl-solution avatar Mar 22 '22 08:03 sl-solution

Hi @sl-solution !

Can you please test against master?

ronisbr avatar Mar 22 '22 10:03 ronisbr

For scenarios like view(ds, :, :) it is great, however, my problem is about the situation when a view has a non consecutive permutation of rows, this happens for groupby and gatherby,

using Random, InMemoryDatasets
ds = Dataset(rand(1:1000, 10^7,3), :auto);
sds = view(ds, shuffle(1:nrow(ds)), :);
@time show(ds); # this is fast
@time show(sds); # 2.419944 seconds (5.29 k allocations: 283.562 KiB)

sl-solution avatar Mar 22 '22 11:03 sl-solution

Hi @sl-solution !

Sorry, I just have time to take a look at this issue now. After profiling, I see that almost the entire execution is inside this function:

https://github.com/sl-solution/InMemoryDatasets.jl/blob/009b9084b9f5263d3f2f8080e423fb6c9c17aac4/src/other/tables.jl#L34

According to Tables.jl API, I need to access an element using this function:

Tables.getcolumn(ctable.table, column_name)[I]

In your case, it calls view(_columns(parent(sds))[i], rows(sds)) which seems very slow:

julia> @btime Tables.getcolumn(sds, :x1)[1]
  4.688 ms (1 allocation: 48 bytes)
314

julia> @btime Tables.getcolumn(ds, :x1)[1]
  147.847 ns (2 allocations: 272 bytes)
403

ronisbr avatar Mar 23 '22 16:03 ronisbr

If that's the case, does the following help?

julia> _tmp = Tables.getcolumn(sds, :x1) # this is expensive but we can do it once
julia> _tmp[1]

sl-solution avatar Mar 24 '22 09:03 sl-solution

Actually no, because this function is called to get the element at each cell. Thus, I cannot store it temporally. Since PrettyTables.jl must work with many other types of data, including those that do no support Tables.jl, it will not be good to create functions to store columns at each different case.

ronisbr avatar Mar 24 '22 12:03 ronisbr

but here _tmp isn't the whole column it is like a pointer to the actual values.

sl-solution avatar Mar 24 '22 18:03 sl-solution

Yes, but PrettyTables will call:

_tmp = Tables.getcolumn(sds, :x1)
_tmp[1]

for each table element, leading to the same result. The only way to improve inside PrettyTables is to store a column and then get all the elements, instead of asking for a column at every cell. However, this is not feasible since I do not render only structures that supports Tables.jl API.

ronisbr avatar Mar 24 '22 21:03 ronisbr

I see, let me think more about this.

sl-solution avatar Mar 24 '22 21:03 sl-solution