materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Collapse `generate_series` when output columns are projected away

Open frankmcsherry opened this issue 2 years ago • 3 comments

This PR updates projection pushdown for FlatMap nodes to notice example cases where output columns are unused, and the number of records can be determined from the data. This is essentially only in support of optimizing generate_series, which has historically forced complete unpacking for something like

select count(*) from generate_series(1, 1000000);

This now optimizes to the constant 1000000 rather than a dataflow that deploys to build one million rows and count them.

Motivation

Tips for reviewer

Checklist

  • [ ] This PR has adequate test coverage / QA involvement has been duly considered.
  • [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [ ] This PR includes the following user-facing behavior changes:

frankmcsherry avatar Oct 27 '23 19:10 frankmcsherry

Explaining queries containing generate_series go OoM now, but worked before:

explain select count(*) from generate_series(0, 999) AS x, generate_series(0, 999) AS y, generate_series(0, 999) AS z;

(Similar issue for csv_extract: https://github.com/MaterializeInc/database-issues/issues/6871)

def- avatar Oct 27 '23 22:10 def-

The first issue is fixed, sorry! The second issue goes a bit deeper, and while I can prevent the OOM, I cannot yet prevent about 45s of "counting up to one billion" on some core. At the heart of it, the AggregateFunc::eval methods require iterators over Datum, which means we need to provide one billion records. That can be done memory-efficiently with an iterator, but seemingly not compute-efficiently, that I have found. I filed https://github.com/MaterializeInc/database-issues/issues/6932 as a follow-up issue, but perhaps the right thing to do is get the changes in that turn this into a CPU issue rather than memory issue, and not land the PR.

frankmcsherry avatar Nov 07 '23 21:11 frankmcsherry

One plausible thing to do is revert the relaxation to fold_reduce_constant that admits constant relations that are small in binary but large in unary, which results in select count(*) from generate_series(1, 1000000) turning in to a non-constant plan that must be shipped to a compute cluster, but at least only takes a mere moment while it uses its binary count implementation.

Independently, the changes to the rest of fold_reduce_constant merit some thought, as it replaces some unpleasant behavior with what I think is a more robust implementation.

frankmcsherry avatar Nov 07 '23 21:11 frankmcsherry