nushell icon indicating copy to clipboard operation
nushell copied to clipboard

perf: better scalability of get_columns

Open blindFS opened this issue 7 months ago • 3 comments

Description

Use hashset for existence checking. Still needs a vector collection to keep the column order for tables.

User-Facing Changes

Should be None

blindFS avatar May 18 '25 08:05 blindFS

perf tag without a benchmark or scaling demo is wishful thinking ;) Where is the point where the overhead of the hash set is payed of by the faster access?

sholderbach avatar May 18 '25 10:05 sholderbach

@blindFS Is it possible to augment our benchmarks demonstrating this speed improvement?

fdncred avatar Jun 13 '25 21:06 fdncred

I won't call it a benchmark, but here's a simple demo:

mut ans = {}
for i in ..100 {$ans = $ans | merge {$i: a}}
let a = $ans
timeit {..999999 | each {$a} | columns }

The last timeit command is improved from ~1min 3sec to ~16.5sec with both unoptimized debug targets, and from ~11sec to 1.2sec with optimized ones.

Obviously even with this patch, this functionality still needs some serious rework, like some kind of metadata fields. But I'm afraid that eventually falls into the table VS list<record> stuff...

blindFS avatar Jun 16 '25 09:06 blindFS

Thanks @blindFS. Let's give it a chance.

fdncred avatar Jun 20 '25 10:06 fdncred

Thanks for the benchmark example! Those are definitely substantial improvements for the large data. I was playing around to see if for more "realistic" numbers of colums the in theory cheaper linear ops are important. Played with low column numbers (2, 5, 10, 20) but still high row counts (10^4, 10^5, 10^6), couldn't see a relevant benefit to special casing low Ns with te basic timeit measurement.

sholderbach avatar Jun 20 '25 13:06 sholderbach

That doesn't help for smaller column numbers for sure. O(RC^2) -> O(RClogC).

blindFS avatar Jun 20 '25 14:06 blindFS

I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for Record for the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)

sholderbach avatar Jun 20 '25 14:06 sholderbach

In a heterogeneous table scenario (with different keys in the rows) the set would also grow with the number of rows even though it should saturate pretty quickly for most real world data

sholderbach avatar Jun 20 '25 14:06 sholderbach

I wanted to make sure we don't dramatically regress due to the in practice longer time it takes to set up a HashSet and the small constant overhead compared to an immediate hit in a short linear serach. (we still use vecs for Record for the reason that on most data it is pretty dang fast even if the complexity in the limit is horrible)

I see, but if we really care about the performance here, we definitely need a metadata field for tables. Even a linear scan is pretty dumb for such simple tasks.

blindFS avatar Jun 20 '25 14:06 blindFS

I don't see how metadata is the savior here, I think it would require explicit pre-verified Value::Table and PipelineData::TableStream to have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)

sholderbach avatar Jun 20 '25 15:06 sholderbach

I don't see how metadata is the savior here, I think it would require explicit pre-verified Value::Table and PipelineData::TableStream to have guaranteed info about the valid columns. (as well as col order as that is also a perf sink having to account for legal reordering)

I don't have a clear view of the implementation yet, what I meant is something like a mandatory type signature associated with the table/list. Getting columns is then a O(1) op.

blindFS avatar Jun 20 '25 15:06 blindFS