deequ Incorrect Mean calculation

Incorrect Mean calculation

Open rmaheshkumarblr opened this issue 2 years ago • 2 comments

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).

Based on my analysis:

In Mean.scala file, the mean is not calculated using the mean function provided by Spark directly, instead the Sum is calculated, the Count is calculated and then a division is being performed.
https://github.com/awslabs/deequ/blob/933417676189bc7833166f976fd024a4b2177292/src/main/scala/com/amazon/deequ/analyzers/Mean.scala#L32
Spark Sum return type is a bigint, so if the sum is really high then an overflow happens and the output is incorrect. As an alternate using the mean function of Spark gives the correct result.

Don't have the entire context behind the calculation of Sum and Count and then calculating the Mean. Would love to hear more about it.

Jul 15 '22 07:07 rmaheshkumarblr

i think that the reason for doing sum then division is to account for previous states to update the mean; that being said this is indeed a bug because Double doesn't have the same precision as Long and overflows will be missed.

There's a larger problem that all metrics are currently represented by Double so we'll need to change some of the underlying architecture to support Long metric values as well.

Feb 02 '23 18:02 shehzad-qureshi

Maybe will can go with simple change? Move from: Double -> BigDecimal Long -> BigInt

Is there any idea how this should be solved? I'm happy to help here.

May 24 '23 19:05 explicite

deequ deequ copied to clipboard

Incorrect Mean calculation

Mean is calculated incorrectly when the value for the column is really high (Example: EpochTimestamp) and the size of the dataset is high as well (Dataset Size).

deequ
deequ copied to clipboard