giskard [GSK-1279] More rigorous evaluation of significance of performance metrics

[GSK-1279] More rigorous evaluation of significance of performance metrics

Open mattbit opened this issue 1 year ago • 3 comments

Following the feedback by user KD_A on reddit. They recommend more sound handling of statistical significance to prevent selection bias, in particular using a Benjamini-Hochberg procedure to control the false discovery rate.

The problem is that we currently test several data slice candidates + metric without accounting for selection bias → this can lead to a high number of false positive detections.

To do

[X] Add simple stat tests to the current implementation and measure the significance on the test models already in pytest fixtures → do we have detections with high p-values?
[X] If we do, check if we can set a FPR parameter in PerformanceBiasDetector and filter the detections based on their p-value with Benjamini-Hochberg procedure.

_{From SyncLinear.com | GSK-1279}

Jun 09 '23 09:06 mattbit

giskard giskard copied to clipboard

[GSK-1279] More rigorous evaluation of significance of performance metrics

To do

giskard
giskard copied to clipboard