velox icon indicating copy to clipboard operation
velox copied to clipboard

Support complex type 'compare' in 3-arg min_by and max_by Presto aggregates

Open f0rest9999 opened this issue 11 months ago • 14 comments

Presto's min_by and max_by with 3 args support complex type for the compare param, such as array.

Fixes #7469

f0rest9999 avatar Mar 05 '24 09:03 f0rest9999

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 6d56823d9c0a33683e41788a5d835b684fc13871
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/6603d1e3e20d05000846d100

netlify[bot] avatar Mar 05 '24 09:03 netlify[bot]

@mbasmanova Could you please review this PR?

f0rest9999 avatar Mar 06 '24 01:03 f0rest9999

Could someone please review this PR? Thanks! @mbasmanova @kagamiori @kgpai

f0rest9999 avatar Mar 12 '24 01:03 f0rest9999

Thank you for this @f0rest9999 ! This PR will take me a bit of time, please give me a couple of days to review.

kgpai avatar Mar 12 '24 04:03 kgpai

Thanks for the review, I'll take the time to think about and fix.

f0rest9999 avatar Mar 15 '24 02:03 f0rest9999

Throw an exception when the key col which participate in comparison contains null.

This is a bit tricky. See kNullAsIndeterminate in CompareFlags.h and #7703

mbasmanova avatar Mar 21 '24 10:03 mbasmanova

nit: This class is very similar to MinMaxByNComplexTypeAccumulator save only difference is that here the compare is comples and there the value is complex, I am curious if there is a way to merge both into 1 (without adding too much templating so its hard to follow).

This comments will be done after our discussion. Other comments I have replied or fixed.Please help review this @kgpai CC @mbasmanova, thanks all!

f0rest9999 avatar Mar 25 '24 05:03 f0rest9999

@mbasmanova Already fixed.

f0rest9999 avatar Mar 26 '24 05:03 f0rest9999

That's a good suggestion. Could you open an issue and see if anyone is interested? While I probably haven't had a lot of time to look into how to implement this recently, and I prefer to get this non-simple Function API PR done first to support complex types. Is that all right?

And I'll spend some time looking at how to test "Aggregation Fuzzer using Presto". Thanks.

f0rest9999 avatar Mar 26 '24 09:03 f0rest9999

Have you run Aggregation Fuzzer using Presto as source before? In the process of running, I found some problems. @mbasmanova @kgpai

And when I run max_by, I got this error, while I think MAP<BOOLEAN,BOOLEAN> should not be passed into the function as a compare key, how can I limit it?

./velox_aggregation_fuzzer_test \
                --seed ${RANDOM} \
                --only "max_by" \
                --duration_sec 3600 \
                --logtostderr=1 \
                --minloglevel=0 \
                --repro_persist_path=/tmp/aggregate_fuzzer_repro \
                --enable_sorted_aggregations=true \
                --presto_url=http://127.0.0.1:8080

Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Unknown input types for max_by (SINGLE) aggregation: SMALLINT, MAP<BOOLEAN,BOOLEAN>, BIGINT
Retriable: False
Function: createNArg
File:MinMaxByAggregates.cpp
Line: 1755

At present, I don't know much about the data-generate process of Fuzzer, seems like a general problem?

f0rest9999 avatar Mar 27 '24 06:03 f0rest9999

@f0rest9999

That's a good suggestion. Could you open an issue and see if anyone is interested? While I probably haven't had a lot of time to look into how to implement this recently, and I prefer to get this non-simple Function API PR done first to support complex types. Is that all right?

I'm afraid we won't be able to merge this code as is. There is too much copy-paste. The code is too hard to read and reason about. We'll need to simply. I believe the first step would be to convert existing implementation to use Simple Function API. The next step would be to add new logic to it and figure out how to avoid so much copy-paste when doing that. Thanks.

mbasmanova avatar Mar 27 '24 08:03 mbasmanova

I'm afraid we won't be able to merge this code as is. There is too much copy-paste. The code is too hard to read and reason about. We'll need to simply. I believe the first step would be to convert existing implementation to use Simple Function API. The next step would be to add new logic to it and figure out how to avoid so much copy-paste when doing that. Thanks.

@mbasmanova Thanks for your all review and suggestion, also @kgpai . The issue can cc@ me. Rewriting to the simple API will take some time. So it probably not be done quickly by me. During the time, if someone else has implemented it, I will learn how to do it.In short, I‘m interested in this issue.

I think the reason it's hard to scale is that templates are used in a lot of places, and the function takes two arguments (various types). The combination of these two reasons made it difficult to extend and reuse code under the old framework.

You can close this PR. Thanks!

f0rest9999 avatar Mar 27 '24 08:03 f0rest9999

Have you run Aggregation Fuzzer using Presto as source before? In the process of running, I found some problems. @mbasmanova @kgpai

And when I run max_by, I got this error, while I think MAP<BOOLEAN,BOOLEAN> should not be passed into the function as a compare key, how can I limit it?

./velox_aggregation_fuzzer_test \
                --seed ${RANDOM} \
                --only "max_by" \
                --duration_sec 3600 \
                --logtostderr=1 \
                --minloglevel=0 \
                --repro_persist_path=/tmp/aggregate_fuzzer_repro \
                --enable_sorted_aggregations=true \
                --presto_url=http://127.0.0.1:8080

Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Unknown input types for max_by (SINGLE) aggregation: SMALLINT, MAP<BOOLEAN,BOOLEAN>, BIGINT
Retriable: False
Function: createNArg
File:MinMaxByAggregates.cpp
Line: 1755

At present, I don't know much about the data-generate process of Fuzzer, seems like a general problem?

Hi @f0rest9999 this might be related to the your PR change as current min_by, max_by doesnt support complex types for comparison. Nevertheless if you can create an issue and add the stack trace I can investigate more.

kgpai avatar Mar 27 '24 16:03 kgpai

Hi @f0rest9999 this might be related to the your PR change as current min_by, max_by doesnt support complex types for comparison. Nevertheless if you can create an issue and add the stack trace I can investigate more.

You may have missed this comment. Once you address it, the error should go away.

After the modification according to mbasmanova's suggestion, there is no such problem. Thanks for your attention.

f0rest9999 avatar Mar 28 '24 02:03 f0rest9999