Description

Use hashset for existence checking. Still needs a vector collection to keep the column order for tables.

User-Facing Changes

Should be None

May 18 '25 08:05 blindFS

perf tag without a benchmark or scaling demo is wishful thinking ;) Where is the point where the overhead of the hash set is payed of by the faster access?

May 18 '25 10:05 sholderbach

@blindFS Is it possible to augment our benchmarks demonstrating this speed improvement?

Jun 13 '25 21:06 fdncred

I won't call it a benchmark, but here's a simple demo:

mut ans = {}
for i in ..100 {$ans = $ans | merge {$i: a}}
let a = $ans
timeit {..999999 | each {$a} | columns }

The last timeit command is improved from ~1min 3sec to ~16.5sec with both unoptimized debug targets, and from ~11sec to 1.2sec with optimized ones.

Obviously even with this patch, this functionality still needs some serious rework, like some kind of metadata fields. But I'm afraid that eventually falls into the table VS list<record> stuff...

Jun 16 '25 09:06 blindFS

Thanks @blindFS. Let's give it a chance.

Jun 20 '25 10:06 fdncred

Thanks for the benchmark example! Those are definitely substantial improvements for the large data. I was playing around to see if for more "realistic" numbers of colums the in theory cheaper linear ops are important. Played with low column numbers (2, 5, 10, 20) but still high row counts (10^4, 10^5, 10^6), couldn't see a relevant benefit to special casing low Ns with te basic timeit measurement.

Jun 20 '25 13:06 sholderbach

That doesn't help for smaller column numbers for sure. O(RC^2) -> O(RClogC).

Jun 20 '25 14:06 blindFS

I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for Record for the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)

Jun 20 '25 14:06 sholderbach

In a heterogeneous table scenario (with different keys in the rows) the set would also grow with the number of rows even though it should saturate pretty quickly for most real world data

Jun 20 '25 14:06 sholderbach

I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for Record for the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)

I see, but if we really care about the performance here, we definitely need a metadata field for tables. Even a linear scan is pretty dumb for such simple tasks.

Jun 20 '25 14:06 blindFS

I don't see how metadata is the savior here, I think it would require explicit pre-verified Value::Table and PipelineData::TableStream to have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)

Jun 20 '25 15:06 sholderbach

I don't see how metadata is the savior here, I think it would require explicit pre-verified Value::Table and PipelineData::TableStream to have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)

I don't have a clear view of the implementation yet, what I meant is something like a mandatory type signature associated with the table/list. Getting columns is then a O(1) op.

Jun 20 '25 15:06 blindFS

perf: better scalability of get_columns

Description

User-Facing Changes