velox
velox copied to clipboard
Support complex type 'compare' in 3-arg min_by and max_by Presto aggregates
Presto's min_by and max_by with 3 args support complex type for the compare param, such as array.
Fixes #7469
Deploy Preview for meta-velox canceled.
Name | Link |
---|---|
Latest commit | 6d56823d9c0a33683e41788a5d835b684fc13871 |
Latest deploy log | https://app.netlify.com/sites/meta-velox/deploys/6603d1e3e20d05000846d100 |
@mbasmanova Could you please review this PR?
Could someone please review this PR? Thanks! @mbasmanova @kagamiori @kgpai
Thank you for this @f0rest9999 ! This PR will take me a bit of time, please give me a couple of days to review.
Thanks for the review, I'll take the time to think about and fix.
Throw an exception when the key col which participate in comparison contains null.
This is a bit tricky. See kNullAsIndeterminate in CompareFlags.h and #7703
nit: This class is very similar to MinMaxByNComplexTypeAccumulator save only difference is that here the compare is comples and there the value is complex, I am curious if there is a way to merge both into 1 (without adding too much templating so its hard to follow).
This comments will be done after our discussion. Other comments I have replied or fixed.Please help review this @kgpai CC @mbasmanova, thanks all!
@mbasmanova Already fixed.
That's a good suggestion. Could you open an issue and see if anyone is interested? While I probably haven't had a lot of time to look into how to implement this recently, and I prefer to get this non-simple Function API PR done first to support complex types. Is that all right?
And I'll spend some time looking at how to test "Aggregation Fuzzer using Presto". Thanks.
Have you run Aggregation Fuzzer using Presto as source before? In the process of running, I found some problems. @mbasmanova @kgpai
And when I run max_by, I got this error, while I think MAP<BOOLEAN,BOOLEAN> should not be passed into the function as a compare key, how can I limit it?
./velox_aggregation_fuzzer_test \
--seed ${RANDOM} \
--only "max_by" \
--duration_sec 3600 \
--logtostderr=1 \
--minloglevel=0 \
--repro_persist_path=/tmp/aggregate_fuzzer_repro \
--enable_sorted_aggregations=true \
--presto_url=http://127.0.0.1:8080
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Unknown input types for max_by (SINGLE) aggregation: SMALLINT, MAP<BOOLEAN,BOOLEAN>, BIGINT
Retriable: False
Function: createNArg
File:MinMaxByAggregates.cpp
Line: 1755
At present, I don't know much about the data-generate process of Fuzzer, seems like a general problem?
@f0rest9999
That's a good suggestion. Could you open an issue and see if anyone is interested? While I probably haven't had a lot of time to look into how to implement this recently, and I prefer to get this non-simple Function API PR done first to support complex types. Is that all right?
I'm afraid we won't be able to merge this code as is. There is too much copy-paste. The code is too hard to read and reason about. We'll need to simply. I believe the first step would be to convert existing implementation to use Simple Function API. The next step would be to add new logic to it and figure out how to avoid so much copy-paste when doing that. Thanks.
I'm afraid we won't be able to merge this code as is. There is too much copy-paste. The code is too hard to read and reason about. We'll need to simply. I believe the first step would be to convert existing implementation to use Simple Function API. The next step would be to add new logic to it and figure out how to avoid so much copy-paste when doing that. Thanks.
@mbasmanova Thanks for your all review and suggestion, also @kgpai . The issue can cc@ me. Rewriting to the simple API will take some time. So it probably not be done quickly by me. During the time, if someone else has implemented it, I will learn how to do it.In short, I‘m interested in this issue.
I think the reason it's hard to scale is that templates are used in a lot of places, and the function takes two arguments (various types). The combination of these two reasons made it difficult to extend and reuse code under the old framework.
You can close this PR. Thanks!
Have you run Aggregation Fuzzer using Presto as source before? In the process of running, I found some problems. @mbasmanova @kgpai
And when I run max_by, I got this error, while I think MAP<BOOLEAN,BOOLEAN> should not be passed into the function as a compare key, how can I limit it?
./velox_aggregation_fuzzer_test \ --seed ${RANDOM} \ --only "max_by" \ --duration_sec 3600 \ --logtostderr=1 \ --minloglevel=0 \ --repro_persist_path=/tmp/aggregate_fuzzer_repro \ --enable_sorted_aggregations=true \ --presto_url=http://127.0.0.1:8080 Error Source: RUNTIME Error Code: INVALID_STATE Reason: Unknown input types for max_by (SINGLE) aggregation: SMALLINT, MAP<BOOLEAN,BOOLEAN>, BIGINT Retriable: False Function: createNArg File:MinMaxByAggregates.cpp Line: 1755
At present, I don't know much about the data-generate process of Fuzzer, seems like a general problem?
Hi @f0rest9999 this might be related to the your PR change as current min_by, max_by doesnt support complex types for comparison. Nevertheless if you can create an issue and add the stack trace I can investigate more.
Hi @f0rest9999 this might be related to the your PR change as current min_by, max_by doesnt support complex types for comparison. Nevertheless if you can create an issue and add the stack trace I can investigate more.
You may have missed this comment. Once you address it, the error should go away.
After the modification according to mbasmanova's suggestion, there is no such problem. Thanks for your attention.