risingwave icon indicating copy to clipboard operation
risingwave copied to clipboard

expr: avoid repeating the same scalar into an array

Open BugenZhao opened this issue 1 year ago • 9 comments

For example, there's an EXTRACT(HOUR FROM col) in Nexmark Q14, where the HOUR is compiled to a literal VARCHAR expression. When evaluating the EXTRACT, we need to first repeat the same scalar "HOUR" 1024 times into an array, then evaluate the outer EXTRACT function. This is not efficient. https://github.com/risingwavelabs/risingwave/issues/8503#issuecomment-1465808412

Possible solutions:

  • Introduce ConstantArray than only stores the scalar and the time it appears, which is essentially a special case of Run-Length Encoding (arrow-array). This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

  • Check whether an argument of the expression is constant (literal) during the build_from_proto with macro (introduced in #8499). In this case, we're not able to handle the structure where a literal is nested under another expression, though in most cases this should be folded by the optimizer.

  • Allow the expression to directly return a scalar, and expands it into an array by repeating only if necessary. This sounds much simpler and the refactoring can be progressive.

BugenZhao avatar Apr 07 '23 12:04 BugenZhao

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

xxchan avatar Apr 07 '23 13:04 xxchan

This seems hard to do this under current architecture as we always use static type for arrays, so introducing a wrapper requires a lot of changes.

Can you elaborate this? How is "static type" a problem and how dynamic is ConstantArray/RunArrary?

For example, arrays in arrow and arrow2 are all trait objects, so it can introduce a RunArray wrapper easily without exposing it to any callers. However in our type system, we need to write a lot of stuff like MaybeRun<Utf8Array> or MaybeRun<ArrayImpl>. 🤔

BugenZhao avatar Jun 12 '23 06:06 BugenZhao

How is the situation now after #9049?

xxchan avatar Jun 20 '23 07:06 xxchan

How is the situation now after #9049?

I guess the ultimate solution should be allowing Value::Scalar to directly be passed among different executors and even remote actors, as described in https://github.com/risingwavelabs/risingwave/pull/9733#issuecomment-1543658669. But yes, It appears that #9049 has accomplished everything we can do without introducing a significant refactor. 😄

BugenZhao avatar Jun 20 '23 09:06 BugenZhao

FWIW this looks similar 👀 https://github.com/apache/arrow-rs/issues/1047

xxchan avatar Jul 03 '23 12:07 xxchan

Wonder if we can further generalize this into some compact encoding for multiple repeated datums. It could potentially optimize join performance, since the datums in the join key don't need to be expanded inline.

kwannoel avatar May 13 '24 08:05 kwannoel

Specifically for high amplification join, when building the new chunk, the probe side's record, just needs to convert its scalar values into constant array, then we can just concat that with the build side to form the new stream chunk.

kwannoel avatar May 20 '24 03:05 kwannoel

Wonder if we can further generalize this into some compact encoding for multiple repeated datums.

Yes. Are you referring to...

BugenZhao avatar May 20 '24 08:05 BugenZhao

Just FYI: eval_v2, introduced in #9049, is not adopted by all proc-macro-generated function impl any more.

BugenZhao avatar Aug 28 '24 06:08 BugenZhao