evidently Fix zero percentage replacement in get_binned_data Function and update parameter name

This pull request introduces changes that address the issue: The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

Dynamic Fill Value Calculation: The fill value used to replace zero percentages is now calculated dynamically based on the actual data, rather than using a fixed value.
Ensuring Correct Fill Value: The fill value is guaranteed to be smaller than the minimum non-zero percentage in both the reference and current datasets. This adjustment ensures that the data distribution remains accurate.
Maintaining Data Distribution: These changes help maintain the correct distribution of data, which is crucial for accurate statistical tests, including Kullback-Leibler divergence drift score calculations.
Parameter Name Update: The parameter name has been updated from "feel_zeroes" to "fill_zeroes" for clarity and consistency.

Hope it helps boemer00

Sep 02 '24 20:09 boemer00

Enhance `get_binned_data` Function with Flexible Fill Methods and Dynamic Scaling

This pull request introduces enhancements to my previous PR regarding the get_binned_data function. These changes address issues with handling very small percentages in data drift calculations while providing some flexibility and maintaining its original purpose.

Changes Made:

New Parameters:
- fill_method (str, default='auto'): Allows users to select the method for calculating the fill value for zero percentages.
- dynamic_scale (bool, default=False): Enables dynamic scaling of the fill value based on the data distribution.
Fill Methods: Three fill methods are now supported:
- auto: Calculates a fill value that is the minimum of 1/10 and 1/2 of the smallest non-zero percentage.
- min: Uses the minimum non-zero percentage as the fill value.
- mean: Uses the average of the minimum non-zero percentages from reference and current data.

Dynamic Scaling: When enabled, applies an additional scaling factor to the fill value:

scale_factor = min(min_non_zero_ref, min_non_zero_cur) / max(min_non_zero_ref, min_non_zero_cur)

This is beneficial in cases where there is a significant difference between the minimum non-zero percentages in the reference and current data.

Normalisation: After applying the fill value, the percentages are normalised to ensure they always sum to 1:

reference_percents = reference_percents / np.sum(reference_percents) current_percents = current_percents / np.sum(current_percents)

Error Handling: Added a ValueError for invalid fill_method values to prevent unexpected behaviour.

Lastly, I am happy to write a test for the new parameters but would appreciate direction on where this should be implemented.

Thanks

Sep 05 '24 15:09 boemer00

The linter check has been failing for a while, and now the branch also has merge conflicts with main. Given the current state of the repo, resolving both issues would likely be more work than starting fresh.

If the changes are still relevant, feel free to rebase onto the latest main and open a new PR. Happy to review it then!

Jun 25 '25 17:06 emeli-dral

Hey @boemer00, thank you for the original work here. Since this branch had diverged and had some linter and merge issues, I’ve opened a fresh PR based on your implementation and rebased it on the latest main.

New PR: #1660

Hello, @emeli-dral! Happy to continue iteration there!

Jul 14 '25 23:07 bharath03-a

Fix zero percentage replacement in get_binned_data Function and update parameter name

Enhance get_binned_data Function with Flexible Fill Methods and Dynamic Scaling

Changes Made:

Enhance `get_binned_data` Function with Flexible Fill Methods and Dynamic Scaling