The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case.
In the implemntation of get_binned_data https://github.com/evidentlyai/evidently/blob/4e4637d361d11354d26f361a07e1c5a847bbfc74/src/evidently/calculations/stattests/utils.py#L33
By default, 0.0001 is used to fill the value with a percent of 0. It is valid when the minimum percent of non-zero is greater than 0.0001. But when the minimum percent in the data is less than or equal to 0.0001, like 0.00001. After we filling zeros, the original zero percent is now 0.0001, which is larger than the existing minimum percent(0.00001), and the data distribution is wrong after filling zeros.
The default value of fill zeros in this place, I think, should be set dynamically according to the percentage of data, rather than being fixed directly. How to set dynamically needs further discussion. For example, it should be set to one tenth of the minimum percent, or other values, but it must be guaranteed to be smaller than the minimum percent, otherwise it will be wrong.
I found this problem because when I tried to calculate Kullback Leibler divergence drift score on my data, I found that the minimum percent of my data was 0.00001, which was smaller than the default fill zeros, which resulted in incorrect Kullback Leibler divergence drift score. For example, the default calculated Kullback Leibler divergence drift score in my data is 0.41. After I change fill zeros to one tenth of the minimum percent, the calculated Kullback Leibler divergence drive score is 1.9.
Hi @chi2liu,
Thanks for reporting it! We will not be able to address it immediately as the ongoing API update is currently a priority, but I added it to the list of feature requests we regularly review. Will get there!
We also welcome a contribution if you want to fix it!
Wow, I managed to mess up a bit with this issue by adding many different references in my local commits! Sorry!! (new to contributing here)
I added opened a pull request for this with a temporal change.
I added an if statement:
np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001)
np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)
So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).
For your case (and other similar) this will increase the Kullback Leibler divergence drift score.
Wow, I managed to mess up a bit with this issue by adding many different references in my local commits! Sorry!! (new to contributing here)
I added opened a pull request for this with a temporal change.
I added an if statement:
np.place(reference_percents, reference_percents == 0, min(reference_percents[reference_percents!=0])/10**6 if min(reference_percents[reference_percents!=0]) <= 0.0001 else 0.0001) np.place(current_percents, current_percents == 0, min(current_percents[current_percents!=0])/10**6 if min(current_percents[current_percents!=0]) <= 0.0001 else 0.0001)So, I basically did what you mentioned @chi2liu but instead of one-tenth I added 10^6. This is just to make it as close to zero (the actual value).
For your case (and other similar) this will increase the Kullback Leibler divergence drift score.
@AntonisCSt I don't think it's meaningful to use 0.0001 as threshold. Maybe it should be taken as 10^6 of the minimum value for all. Because we want to make it as close to zero (the actual value) as you mentioned. Why is the exception for greater than 0.0001? I don't know why it was set to 0.0001 by default at present, perhaps because of special considerations. @emeli-dral
@chi2liu yes good point. That definitely needs a discussion. I kept it so it could contain the initial logic of the creator. But I agree.
I will put an additional argument for my fix of why 10^6 and not 10^9. (hehe)
Hi @emeli-dral hope you are well.
I have submitted a pull request for this issue.
It should be sorted and I also suggested to update the parameter name from feel_zeroes to fill_zeroes for clarity.
thanks!
Definitely makes more sense now ^^
hi i would like to contribute
Hi, I’m new to open source and would like to work on this. May I take it?
Hi @emeli-dral!
I’ve created a new PR #1660 that addresses this issue, based on the original work done in PR #1278. The code has since moved to the legacy path, but I’ve rebased and updated it to reflect the current structure.
Hope this resolves the issue, happy to contribute more improvements going forward!