sktime [BUG] `test_fit_idempotent()` failing to produce 'equal' results to the required degree of accuracy for DL estimators

Describe the bug

First found during #3425, this bug causes failures as the idempotent property of estimators seems to be violated. This results in a bunch of AssertionErrors only for DL estimators. I conducted a diagnostic run via #3611 and found that the degree to which they differ is vastly different.

For example, In the case of CNNClassifer's predict_proba, the values don't seem to be very far apart.

On the opposite end, we have something like CNNRegressor's predict, where the values are too far off.

Initially, this test was skipped for all DL estimators. So that's probably why this went undiagnosed for so long.

Edit: This error is also sporadic in nature, ie. it doesn't always fail. This can be seen after checking the new failures after skipping all the problematic estimators here. This also means that all estimators are affected (which implies probably the source is not from estimators).

Oct 19 '22 14:10 achieveordie

As discussed during DL/Benchmarking meeting on 21/10/22

I think we need to re-evaluate whether we need to fix this bug currently. The bug has existed since the first DL classifier import and the test had to be ignored for the reason specified by @achieveordie. One of the tried out solutions was to have a specified random_state but that, for some reason, did not pass all the tests all the time.

The reason I am saying it might not be neccessary to solve this is:

As discussed with @TonyBagnall some months ago, when we do incorporate GPU support for DL estimators (which will be required) there will be randomness which is going to cause slightly different values. GPU support work was started by my in #3072 but had been abandoned since my focus was more on other estimator migration. Hence there always will be some randomness in the result leading to some divergence.
If the idea is to just fine tune the number of significance digit difference, even then I feel it would be not easily generalisable. As @achieveordie mentions, there are some cases where there are very less divergence (to satisfy idempotent test if I recall correctly they had to be similar upto 1e-6) but in some cases it is more. It would be difficult to find a general limit which will satisfy all the estimators, especially if we even can find a threshold we would also need to make it general enough such that any future estimators are held to the same threshold (which may or may not be the case since the estimators present are faithful to the original implementation by authors, there would not be much scope for change)

Oct 22 '22 10:10 AurumnPegasus

I agree with this, I think you will just run into the inherent instability of TensorFlow, and the utility of these tests imo vanishes if you have to engineer it so much to pass. Initially I used an almost equal test, but it is not robust.

Oct 22 '22 10:10 TonyBagnall

I think we need to be careful and precsise about what we are even saying. The subtlety lies in what the interface contract is.

E.g., do we agree on the following statement to be part of the (implied) contract:

if an estimator has a random_state parameter, then if set, it will to render the estimator behaviour deterministic, conditional on a non-None integer value being passed.

If we agree to this contract, then the test is correct and the estimators are genuinely failing, which would give it the status of bugs.

Because I do think the contract is what we are assuming and implying, I do think that these are bugs and the estimators are genuinely failing. Arguing that it all is fine without reference to the contract seems a bit of fuzzy thinking around contracts and tests. (without any explicit reference to the contract, I don't think the arguments of @achieveordie or @TonyBagnall are clear)

Oct 23 '22 15:10 fkiraly

@fkiraly You're absolutely correct that we need to have this conversation with respect to the expected contract. What I meant when I said "the problems do not lie within estimators" was that the individual estimators might not be at fault and are instead failing due to underlying abstract base class.

I probably could have worded it better. But at this point, I think all we are doing is making an informed guess as to where the problem lies since we haven't pinpointed the source of this bug.

Oct 26 '22 13:10 achieveordie

In case it helps anyone, I tried to get InceptionTime to be determinstic. It seems like less tests are failing, but there are still some failures, e.g., of pytest sktime/tests/test_all_estimators.py::TestAllEstimators::test_fit_idempotent[InceptionTimeClassifier-1-ClassifierFitPredict-predict_proba].

Changes relative to ffe5b1e4a61f8fb2f34fed0bfa15f854e54111f1:

diff --git a/sktime/classification/deep_learning/inceptiontime.py b/sktime/classification/deep_learning/inceptiontime.py
index 897e2acc..e3cd6e4e 100644
--- a/sktime/classification/deep_learning/inceptiontime.py
+++ b/sktime/classification/deep_learning/inceptiontime.py
@@ -2,6 +2,13 @@
 __author__ = "james-large"
 __all__ = ["InceptionTimeClassifier"]
 
+import tensorflow as tf
+
+tf.keras.utils.set_random_seed(1)
+tf.config.experimental.enable_op_determinism()
+tf.config.threading.set_inter_op_parallelism_threads(1)
+tf.config.threading.set_intra_op_parallelism_threads(1)
+
 from copy import deepcopy
 
 from sklearn.utils import check_random_state
diff --git a/sktime/networks/inceptiontime.py b/sktime/networks/inceptiontime.py
index f3dc0e9a..5b7aeafe 100644
--- a/sktime/networks/inceptiontime.py
+++ b/sktime/networks/inceptiontime.py
@@ -1,10 +1,16 @@
 """Inception Time."""
 __author__ = ["JamesLarge", "Withington"]
 
+import tensorflow as tf
+
+tf.keras.utils.set_random_seed(1)
+tf.config.experimental.enable_op_determinism()
+tf.config.threading.set_inter_op_parallelism_threads(1)
+tf.config.threading.set_intra_op_parallelism_threads(1)
+
 from sktime.networks.base import BaseDeepNetwork
 from sktime.utils.validation._dependencies import _check_dl_dependencies
 
-
 class InceptionTimeNetwork(BaseDeepNetwork):
     """InceptionTime adapted from the implementation from Fawaz et al.
 
diff --git a/sktime/tests/_config.py b/sktime/tests/_config.py
index c2160a9c..ae66a852 100644
--- a/sktime/tests/_config.py
+++ b/sktime/tests/_config.py
@@ -36,7 +36,7 @@ EXCLUDE_ESTIMATORS = [
     "EditDist",
     "CNNClassifier",
     "FCNClassifier",
-    "InceptionTimeClassifier",
+    #"InceptionTimeClassifier",
     "LSTMFCNClassifier",
     "MLPClassifier",
     "MLPRegressor",
@@ -149,7 +149,7 @@ EXCLUDED_TESTS = {
         "test_save_estimators_to_file",
     ],
     "InceptionTimeClassifier": [
-        "test_fit_idempotent",
+    #    "test_fit_idempotent",
     ],
     "SimpleRNNClassifier": [
         "test_fit_idempotent",
diff --git a/sktime/tests/test_all_estimators.py b/sktime/tests/test_all_estimators.py
index f7ad4ef4..486aff43 100644
--- a/sktime/tests/test_all_estimators.py
+++ b/sktime/tests/test_all_estimators.py
@@ -13,6 +13,14 @@ from copy import deepcopy
 from inspect import getfullargspec, isclass, signature
 from tempfile import TemporaryDirectory
 
+import tensorflow as tf
+
+tf.keras.utils.set_random_seed(1)
+tf.config.experimental.enable_op_determinism()
+tf.config.threading.set_inter_op_parallelism_threads(1)
+tf.config.threading.set_intra_op_parallelism_threads(1)
+
+
 import joblib
 import numpy as np
 import pandas as pd

Everything I've read about making tensorflow deterministic seems to indicate that these changes should be sufficient, but they are clearly not. The second call to fit differs from the first call, and the result of the first fit call changes for every pytest invocation. I'm running this test on "AMD Ryzen 7 5700G with Radeon Graphics".

May 02 '24 13:05 cedricdonie

Interesting - but if this reduces the degree of randomness, tha twould already be a big win!

I suppose the place to set the config would be in BaseDeepClassifier, or BaseDeepNetwork?

May 04 '24 16:05 fkiraly