evidently icon indicating copy to clipboard operation
evidently copied to clipboard

Diffrent sizes of refrence dataset and current dataset

Open prarshah opened this issue 3 years ago • 1 comments

@emeli-dral In case the reference dataset which is usually the historic dataset is much bigger in size as compared to the current dataset, in that case, should we rescale the frequencies so that we can compare distributions of similar size?

ref: https://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/stat/inference/ChiSquareTest.html

prarshah avatar Oct 19 '21 08:10 prarshah

Hi @prarshah ,

thank you for bringing this up!

Currently we do not account for the dataset sizes in this test, it makes sense to rescale it.

To add to it, this test does not work properly when the observed or expected frequencies in each category are too small (each less than 5). For this case it is better to use something like Barnard’s test, which we are adding soon as well.

emeli-dral avatar Oct 19 '21 19:10 emeli-dral

We have added several drift detection metrics (like Wassershtein distance), which are less sensitive to the dataset sizes.

emeli-dral avatar Sep 21 '23 13:09 emeli-dral