pinot
pinot copied to clipboard
[Feature] Support for Correlation Function
Add support for calculating Pearson's coefficient corr(x,y)
as part of the effort in #8493. Can leverage implementation for COVAR_SAMP and COVAR_POP as mentioned in #9236.
I'm new to the community and would love to start contributing with this task.
Can i work on the task?
Yes definitely. We can help you
@jasperjiaguo / @SabrinaZhaozyf - can you please share any information, previous PRs etc that can help @VenkatDatta get started.
Hi @VenkatDatta, thank you for taking this up! Hopefully, the following information can help you get started:)
Definition https://en.wikipedia.org/wiki/Correlation. In Pinot, correlation can be used to describe the dependence/association of two columns.
Support in Existing DBs
Presto: https://prestodb.io/docs/current/functions/aggregate.html#statistical-aggregate-functions
Postgres: https://www.postgresql.org/docs/9.4/functions-aggregate.html
Pinot should follow the same syntax: corr(x, y)
-> DOUBLE
Calculation
Formula for correlation can be found in https://en.wikipedia.org/wiki/Correlation. You could also think of it as a normalized covariance. corr(x, y) = cov(x, y) / (std(x) * std(y))
Related PRs
- Background: #8493
- Covariance: #9236
- Histogram: #8724
Good Starting Point
- Run/step through unit tests for aggregation functions under
src/test/java/org/apache/pinot/queries
to familiarize with how aggregation happens distributedly- See how aggregate and merge update the states we are keeping track and how are they different?
- Helpful breakpoints
org/apache/pinot/core/query/reduce/AggregationDataTableReducer.java
org/apache/pinot/core/operator/combine/BaseCombineOperator.java
org/apache/pinot/core/operator/query/AggregationOperator.java
- Extend
CovarianceTuple
to add fields(sum of squares?) that help calculate the standard deviation for the 2 columns - Also helpful to look at other classes in
org/apache/pinot/segment/local/customobject
- Also good to look at other aggregation functions under
org/apache/pinot/core/query/aggregation/function
and see how they are wired in
Testing
- Should have very similar structure as
CovarianceQueriesTest.java
- Want to test on both individual segments and inter-segments to make sure both intermediate results and reduced results are correct
- Make sure to test on distinct servers
- Please feel free to ask me any questions! This part can be a bit tricky.
Please let me / @jasperjiaguo know if you have any questions!
Got it, i will go through the steps mentioned.
Thanks @SabrinaZhaozyf , @siddharthteotia for sharing the resources :)
working on this one, should have a PR in a few days with tests.
Hi @subkanthi - just curious if you are planning to put out a PR for this.
Hi @subkanthi - just curious if you are planning to put out a PR for this.
Hi @siddharthteotia , should have the PR up in a day, finishing the last test.
@siddharthteotia while testing, noticed that with the following implementation get a lot of NaN values, but with apache commons math PearsonCorrelation class its not NaN, just like the skewness
and kurtosis
implementation trying to change to use the apache commons function.
https://github.com/subkanthi/pinot/pull/1
double sumX = correlationTuple.getSumX();
double sumY = correlationTuple.getSumY();
double sumXY = correlationTuple.getSumXY();
double squareSumX = correlationTuple.getSquareSumX();
double squareSumY = correlationTuple.getSquareSumY();
double bottom = Math.sqrt((count * squareSumX
- sumX * sumX) * (count * squareSumY - sumY * sumY));
if (bottom == 0) {
return 0d;
}
double top = count * sumXY - sumX * sumY;
return top / bottom;
}
May be you are running into divide by 0 problem ? Did you try step into the code to understand where it is turning into NaN ?
cc @jasperjiaguo / @SabrinaZhaozyf
NaN could be due to Math.sqrt(negative_number) or 0.0/0.0 We have recently discovered this by-definition impl of covariance/correlation has numerical stability issue when E[x^2] ~ E[x]^2 >> 0 (see 1 2 3). Could we also use similar implementations? I'm not sure if the online algorithm is already available as a library, but feel free to use if apache common has it.
Thanks @jasperjiaguo , will compare the Trino implementation, Apache commons does expose PearsonCorrelation. will update the PR in a day or two.