perf: better scalability of get_columns
Description
Use hashset for existence checking. Still needs a vector collection to keep the column order for tables.
User-Facing Changes
Should be None
perf tag without a benchmark or scaling demo is wishful thinking ;) Where is the point where the overhead of the hash set is payed of by the faster access?
@blindFS Is it possible to augment our benchmarks demonstrating this speed improvement?
I won't call it a benchmark, but here's a simple demo:
mut ans = {}
for i in ..100 {$ans = $ans | merge {$i: a}}
let a = $ans
timeit {..999999 | each {$a} | columns }
The last timeit command is improved from ~1min 3sec to ~16.5sec with both unoptimized debug targets, and from ~11sec to 1.2sec with optimized ones.
Obviously even with this patch, this functionality still needs some serious rework, like some kind of metadata fields. But I'm afraid that eventually falls into the table VS list<record> stuff...
Thanks @blindFS. Let's give it a chance.
Thanks for the benchmark example! Those are definitely substantial improvements for the large data. I was playing around to see if for more "realistic" numbers of colums the in theory cheaper linear ops are important. Played with low column numbers (2, 5, 10, 20) but still high row counts (10^4, 10^5, 10^6), couldn't see a relevant benefit to special casing low Ns with te basic timeit measurement.
That doesn't help for smaller column numbers for sure. O(RC^2) -> O(RClogC).
I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for Record for the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)
In a heterogeneous table scenario (with different keys in the rows) the set would also grow with the number of rows even though it should saturate pretty quickly for most real world data
I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for
Recordfor the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)
I see, but if we really care about the performance here, we definitely need a metadata field for tables. Even a linear scan is pretty dumb for such simple tasks.
I don't see how metadata is the savior here, I think it would require explicit pre-verified Value::Table and PipelineData::TableStream to have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)
I don't see how metadata is the savior here, I think it would require explicit pre-verified
Value::TableandPipelineData::TableStreamto have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)
I don't have a clear view of the implementation yet, what I meant is something like a mandatory type signature associated with the table/list. Getting columns is then a O(1) op.