giskard
giskard copied to clipboard
[GSK-1279] More rigorous evaluation of significance of performance metrics
Following the feedback by user KD_A on reddit. They recommend more sound handling of statistical significance to prevent selection bias, in particular using a Benjamini-Hochberg procedure to control the false discovery rate.
The problem is that we currently test several data slice candidates + metric without accounting for selection bias → this can lead to a high number of false positive detections.
To do
- [X] Add simple stat tests to the current implementation and measure the significance on the test models already in pytest fixtures → do we have detections with high p-values?
- [X] If we do, check if we can set a FPR parameter in
PerformanceBiasDetector
and filter the detections based on their p-value with Benjamini-Hochberg procedure.
From SyncLinear.com | GSK-1279