pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[Feature] Support for Correlation Function

Open SabrinaZhaozyf opened this issue 2 years ago • 4 comments

Add support for calculating Pearson's coefficient corr(x,y) as part of the effort in #8493. Can leverage implementation for COVAR_SAMP and COVAR_POP as mentioned in #9236.

SabrinaZhaozyf avatar Aug 24 '22 22:08 SabrinaZhaozyf

I'm new to the community and would love to start contributing with this task.

Can i work on the task?

VenkatDatta avatar Aug 25 '22 12:08 VenkatDatta

Yes definitely. We can help you

@jasperjiaguo / @SabrinaZhaozyf - can you please share any information, previous PRs etc that can help @VenkatDatta get started.

siddharthteotia avatar Aug 25 '22 14:08 siddharthteotia

Hi @VenkatDatta, thank you for taking this up! Hopefully, the following information can help you get started:)

Definition https://en.wikipedia.org/wiki/Correlation. In Pinot, correlation can be used to describe the dependence/association of two columns.

Support in Existing DBs Presto: https://prestodb.io/docs/current/functions/aggregate.html#statistical-aggregate-functions Postgres: https://www.postgresql.org/docs/9.4/functions-aggregate.html Pinot should follow the same syntax: corr(x, y) -> DOUBLE

Calculation Formula for correlation can be found in https://en.wikipedia.org/wiki/Correlation. You could also think of it as a normalized covariance. corr(x, y) = cov(x, y) / (std(x) * std(y))

Related PRs

  • Background: #8493
  • Covariance: #9236
  • Histogram: #8724

Good Starting Point

  • Run/step through unit tests for aggregation functions under src/test/java/org/apache/pinot/queries to familiarize with how aggregation happens distributedly
    • See how aggregate and merge update the states we are keeping track and how are they different?
    • Helpful breakpoints org/apache/pinot/core/query/reduce/AggregationDataTableReducer.java org/apache/pinot/core/operator/combine/BaseCombineOperator.java org/apache/pinot/core/operator/query/AggregationOperator.java
  • Extend CovarianceTuple to add fields(sum of squares?) that help calculate the standard deviation for the 2 columns
  • Also helpful to look at other classes in org/apache/pinot/segment/local/customobject
  • Also good to look at other aggregation functions under org/apache/pinot/core/query/aggregation/function and see how they are wired in

Testing

  • Should have very similar structure as CovarianceQueriesTest.java
  • Want to test on both individual segments and inter-segments to make sure both intermediate results and reduced results are correct
  • Make sure to test on distinct servers
    • Please feel free to ask me any questions! This part can be a bit tricky.

Please let me / @jasperjiaguo know if you have any questions!

SabrinaZhaozyf avatar Aug 25 '22 17:08 SabrinaZhaozyf

Got it, i will go through the steps mentioned.

Thanks @SabrinaZhaozyf , @siddharthteotia for sharing the resources :)

VenkatDatta avatar Aug 25 '22 17:08 VenkatDatta

working on this one, should have a PR in a few days with tests.

subkanthi avatar Dec 21 '22 22:12 subkanthi

Hi @subkanthi - just curious if you are planning to put out a PR for this.

siddharthteotia avatar Jan 14 '23 01:01 siddharthteotia

Hi @subkanthi - just curious if you are planning to put out a PR for this.

Hi @siddharthteotia , should have the PR up in a day, finishing the last test.

subkanthi avatar Jan 14 '23 02:01 subkanthi

@siddharthteotia while testing, noticed that with the following implementation get a lot of NaN values, but with apache commons math PearsonCorrelation class its not NaN, just like the skewness and kurtosis implementation trying to change to use the apache commons function. https://github.com/subkanthi/pinot/pull/1

            double sumX = correlationTuple.getSumX();
            double sumY = correlationTuple.getSumY();
            double sumXY = correlationTuple.getSumXY();
            double squareSumX = correlationTuple.getSquareSumX();
            double squareSumY = correlationTuple.getSquareSumY();

            double bottom = Math.sqrt((count * squareSumX
                    - sumX * sumX) * (count * squareSumY - sumY * sumY));

            if (bottom == 0) {
                return 0d;
            }
            double top = count * sumXY - sumX * sumY;
            return top / bottom;
        }

subkanthi avatar Jan 16 '23 02:01 subkanthi

May be you are running into divide by 0 problem ? Did you try step into the code to understand where it is turning into NaN ?

cc @jasperjiaguo / @SabrinaZhaozyf

siddharthteotia avatar Jan 19 '23 01:01 siddharthteotia

NaN could be due to Math.sqrt(negative_number) or 0.0/0.0 We have recently discovered this by-definition impl of covariance/correlation has numerical stability issue when E[x^2] ~ E[x]^2 >> 0 (see 1 2 3). Could we also use similar implementations? I'm not sure if the online algorithm is already available as a library, but feel free to use if apache common has it.

jasperjiaguo avatar Jan 19 '23 04:01 jasperjiaguo

Thanks @jasperjiaguo , will compare the Trino implementation, Apache commons does expose PearsonCorrelation. will update the PR in a day or two.

subkanthi avatar Jan 23 '23 19:01 subkanthi