alibi-detect icon indicating copy to clipboard operation
alibi-detect copied to clipboard

test_ksdrift is flaky when all seed-setting code is removed

Open melonwater211 opened this issue 3 years ago • 0 comments

Introduction

The test test_ksdrift in alibi_detect/cd/tests/test_ks.py seems to be flaky when all seed-setting code (e.g. np.random.seed(0) or tf.random.set_seed(0)) is commented out.

For instance, in commit 1b06ecd37a08280d3bcff2b41b123a1f528afc0d (version 0.5.2), test_ksdrift[368] to test_ksdrift[375] fail ~4-12% of the time (out of 500 runs) when all seed-setting code is removed compared to 0% of the time (out of 500 runs) when no seed-setting code is removed.

Tests 368-375 test the "less" alternative hypothesis of the KS drift detector using UAE under: Bonferroni and FDR correction: correction = ['bonferroni', 'fdr']; reservoir sampling and latest sampling: update_X_ref = [{'last': 1000}, {'reservoir_sampling': 1000}]; and whether the preprocessing step is used: preprocess_X_ref = [True, False].

Motivation

Some tests can be flaky with high failure rates, but are not discovered when the seeds are set, such as in the case of the aforementioned test. We are trying to stabilize such tests.

Environment

The tests were run using pytest 6.2.2 in a conda environment with Python 3.6.13. The OS used was Ubuntu 16.04.

Possible Solutions

One possible solution to reduce flakiness is to change the parameters used for prediction. We tried changing the following parameters.

Increasing n_infer from 2 to 10 does not seem to reduce the failure rate.

Increasing update_X_ref from 1000 to 3000 seems to reduce the failure rate to 2-5%.

Increasing update_X_ref to 7500 also reduces the failure rate to 2-5%, though the distribution of failures is different as compared to changing the parameter to 3000.

Changing update_X_ref does not change runtimes by much.

Please let me know if this solution is feasible or if there are any other solutions that should be incorporated. If you are interested, we can send the details of other tests demonstrating similar behavior. We will be happy to raise a Pull Request to fix the tests and incorporate any feedback that you may have.

melonwater211 avatar May 23 '21 21:05 melonwater211