evidently icon indicating copy to clipboard operation
evidently copied to clipboard

Fix zero percentage replacement in get_binned_data Function and update parameter name

Open boemer00 opened this issue 1 year ago • 1 comments

This pull request introduces changes that address the issue: The fixed value for feel_zeroes in get_binned_data may lead to deviation in some case. #334

  • Dynamic Fill Value Calculation: The fill value used to replace zero percentages is now calculated dynamically based on the actual data, rather than using a fixed value.

  • Ensuring Correct Fill Value: The fill value is guaranteed to be smaller than the minimum non-zero percentage in both the reference and current datasets. This adjustment ensures that the data distribution remains accurate.

  • Maintaining Data Distribution: These changes help maintain the correct distribution of data, which is crucial for accurate statistical tests, including Kullback-Leibler divergence drift score calculations.

  • Parameter Name Update: The parameter name has been updated from "feel_zeroes" to "fill_zeroes" for clarity and consistency.

Hope it helps boemer00

boemer00 avatar Sep 02 '24 20:09 boemer00

Enhance get_binned_data Function with Flexible Fill Methods and Dynamic Scaling

This pull request introduces enhancements to my previous PR regarding the get_binned_data function. These changes address issues with handling very small percentages in data drift calculations while providing some flexibility and maintaining its original purpose.

Changes Made:

  1. New Parameters:

    • fill_method (str, default='auto'): Allows users to select the method for calculating the fill value for zero percentages.
    • dynamic_scale (bool, default=False): Enables dynamic scaling of the fill value based on the data distribution.
  2. Fill Methods: Three fill methods are now supported:

    • auto: Calculates a fill value that is the minimum of 1/10 and 1/2 of the smallest non-zero percentage.
    • min: Uses the minimum non-zero percentage as the fill value.
    • mean: Uses the average of the minimum non-zero percentages from reference and current data.
  3. Dynamic Scaling: When enabled, applies an additional scaling factor to the fill value:

    scale_factor = min(min_non_zero_ref, min_non_zero_cur) / max(min_non_zero_ref, min_non_zero_cur)
    
    

This is beneficial in cases where there is a significant difference between the minimum non-zero percentages in the reference and current data.

  1. Normalisation: After applying the fill value, the percentages are normalised to ensure they always sum to 1:

reference_percents = reference_percents / np.sum(reference_percents) current_percents = current_percents / np.sum(current_percents)

  1. Error Handling: Added a ValueError for invalid fill_method values to prevent unexpected behaviour.

Lastly, I am happy to write a test for the new parameters but would appreciate direction on where this should be implemented.

Thanks

boemer00 avatar Sep 05 '24 15:09 boemer00

The linter check has been failing for a while, and now the branch also has merge conflicts with main. Given the current state of the repo, resolving both issues would likely be more work than starting fresh.

If the changes are still relevant, feel free to rebase onto the latest main and open a new PR. Happy to review it then!

emeli-dral avatar Jun 25 '25 17:06 emeli-dral

Hey @boemer00, thank you for the original work here. Since this branch had diverged and had some linter and merge issues, I’ve opened a fresh PR based on your implementation and rebased it on the latest main.

New PR: #1660

Hello, @emeli-dral! Happy to continue iteration there!

bharath03-a avatar Jul 14 '25 23:07 bharath03-a