PrettyTables.jl
PrettyTables.jl copied to clipboard
Following the comment in discourse
Thank you for supporting the development of InMemoryDatasets.jl
.
I added a comment to your post in discourse (here), however, I thought it might be better to discuss it here.
Currently printing a large view of a data set is very slow, I wondering if this can be improved?
Hi @sl-solution !
Can you please test against master?
For scenarios like view(ds, :, :)
it is great, however, my problem is about the situation when a view has a non consecutive permutation of rows, this happens for groupby
and gatherby
,
using Random, InMemoryDatasets
ds = Dataset(rand(1:1000, 10^7,3), :auto);
sds = view(ds, shuffle(1:nrow(ds)), :);
@time show(ds); # this is fast
@time show(sds); # 2.419944 seconds (5.29 k allocations: 283.562 KiB)
Hi @sl-solution !
Sorry, I just have time to take a look at this issue now. After profiling, I see that almost the entire execution is inside this function:
https://github.com/sl-solution/InMemoryDatasets.jl/blob/009b9084b9f5263d3f2f8080e423fb6c9c17aac4/src/other/tables.jl#L34
According to Tables.jl API, I need to access an element using this function:
Tables.getcolumn(ctable.table, column_name)[I]
In your case, it calls view(_columns(parent(sds))[i], rows(sds))
which seems very slow:
julia> @btime Tables.getcolumn(sds, :x1)[1]
4.688 ms (1 allocation: 48 bytes)
314
julia> @btime Tables.getcolumn(ds, :x1)[1]
147.847 ns (2 allocations: 272 bytes)
403
If that's the case, does the following help?
julia> _tmp = Tables.getcolumn(sds, :x1) # this is expensive but we can do it once
julia> _tmp[1]
Actually no, because this function is called to get the element at each cell. Thus, I cannot store it temporally. Since PrettyTables.jl must work with many other types of data, including those that do no support Tables.jl, it will not be good to create functions to store columns at each different case.
but here _tmp isn't the whole column it is like a pointer to the actual values.
Yes, but PrettyTables will call:
_tmp = Tables.getcolumn(sds, :x1)
_tmp[1]
for each table element, leading to the same result. The only way to improve inside PrettyTables is to store a column and then get all the elements, instead of asking for a column at every cell. However, this is not feasible since I do not render only structures that supports Tables.jl API.
I see, let me think more about this.