evidently Diffrent sizes of refrence dataset and current dataset

Diffrent sizes of refrence dataset and current dataset

Open prarshah opened this issue 3 years ago • 1 comments

@emeli-dral In case the reference dataset which is usually the historic dataset is much bigger in size as compared to the current dataset, in that case, should we rescale the frequencies so that we can compare distributions of similar size?

ref: https://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/stat/inference/ChiSquareTest.html

Oct 19 '21 08:10 prarshah

Hi @prarshah ,

thank you for bringing this up!

Currently we do not account for the dataset sizes in this test, it makes sense to rescale it.

To add to it, this test does not work properly when the observed or expected frequencies in each category are too small (each less than 5). For this case it is better to use something like Barnard’s test, which we are adding soon as well.

Oct 19 '21 19:10 emeli-dral

We have added several drift detection metrics (like Wassershtein distance), which are less sensitive to the dataset sizes.

Sep 21 '23 13:09 emeli-dral

evidently evidently copied to clipboard

Diffrent sizes of refrence dataset and current dataset

evidently
evidently copied to clipboard