Collapse `generate_series` when output columns are projected away
This PR updates projection pushdown for FlatMap nodes to notice example cases where output columns are unused, and the number of records can be determined from the data. This is essentially only in support of optimizing generate_series, which has historically forced complete unpacking for something like
select count(*) from generate_series(1, 1000000);
This now optimizes to the constant 1000000 rather than a dataflow that deploys to build one million rows and count them.
Motivation
Tips for reviewer
Checklist
- [ ] This PR has adequate test coverage / QA involvement has been duly considered.
- [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
- [ ] If this PR evolves an existing
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
- [ ] This PR includes the following user-facing behavior changes:
Explaining queries containing generate_series go OoM now, but worked before:
explain select count(*) from generate_series(0, 999) AS x, generate_series(0, 999) AS y, generate_series(0, 999) AS z;
(Similar issue for csv_extract: https://github.com/MaterializeInc/database-issues/issues/6871)
The first issue is fixed, sorry! The second issue goes a bit deeper, and while I can prevent the OOM, I cannot yet prevent about 45s of "counting up to one billion" on some core. At the heart of it, the AggregateFunc::eval methods require iterators over Datum, which means we need to provide one billion records. That can be done memory-efficiently with an iterator, but seemingly not compute-efficiently, that I have found. I filed https://github.com/MaterializeInc/database-issues/issues/6932 as a follow-up issue, but perhaps the right thing to do is get the changes in that turn this into a CPU issue rather than memory issue, and not land the PR.
One plausible thing to do is revert the relaxation to fold_reduce_constant that admits constant relations that are small in binary but large in unary, which results in select count(*) from generate_series(1, 1000000) turning in to a non-constant plan that must be shipped to a compute cluster, but at least only takes a mere moment while it uses its binary count implementation.
Independently, the changes to the rest of fold_reduce_constant merit some thought, as it replaces some unpleasant behavior with what I think is a more robust implementation.