evidently
evidently copied to clipboard
Diffrent sizes of refrence dataset and current dataset
@emeli-dral In case the reference dataset which is usually the historic dataset is much bigger in size as compared to the current dataset, in that case, should we rescale the frequencies so that we can compare distributions of similar size?
ref: https://commons.apache.org/proper/commons-math/javadocs/api-3.3/org/apache/commons/math3/stat/inference/ChiSquareTest.html
Hi @prarshah ,
thank you for bringing this up!
Currently we do not account for the dataset sizes in this test, it makes sense to rescale it.
To add to it, this test does not work properly when the observed or expected frequencies in each category are too small (each less than 5). For this case it is better to use something like Barnard’s test, which we are adding soon as well.
We have added several drift detection metrics (like Wassershtein distance), which are less sensitive to the dataset sizes.