deequ icon indicating copy to clipboard operation
deequ copied to clipboard

Incorrect Mean calculation

Open rmaheshkumarblr opened this issue 2 years ago • 2 comments

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).

Based on my analysis:

  • In Mean.scala file, the mean is not calculated using the mean function provided by Spark directly, instead the Sum is calculated, the Count is calculated and then a division is being performed.

  • https://github.com/awslabs/deequ/blob/933417676189bc7833166f976fd024a4b2177292/src/main/scala/com/amazon/deequ/analyzers/Mean.scala#L32

  • Spark Sum return type is a bigint, so if the sum is really high then an overflow happens and the output is incorrect. As an alternate using the mean function of Spark gives the correct result.

Don't have the entire context behind the calculation of Sum and Count and then calculating the Mean. Would love to hear more about it.

rmaheshkumarblr avatar Jul 15 '22 07:07 rmaheshkumarblr

i think that the reason for doing sum then division is to account for previous states to update the mean; that being said this is indeed a bug because Double doesn't have the same precision as Long and overflows will be missed.

There's a larger problem that all metrics are currently represented by Double so we'll need to change some of the underlying architecture to support Long metric values as well.

shehzad-qureshi avatar Feb 02 '23 18:02 shehzad-qureshi

Maybe will can go with simple change? Move from: Double -> BigDecimal Long -> BigInt

Is there any idea how this should be solved? I'm happy to help here.

explicite avatar May 24 '23 19:05 explicite