scikit-learn Fix normalization bug in GaussianProcessRegressor

Fixes #29697

Bug Fix:

Normalization: Corrected the normalization logic in the fit method to ensure consistent handling of the _y_train_mean and _y_train_std variables, regardless of whether normalize_y is True or False.
Modified the prediction logic to apply the inverse transformation only if normalize_y is True, ensuring predictions are correctly scaled.

Test Suite Correction:

Updated test_y_multioutput to correctly handle cases where the second output's standard deviation should scale differently when normalize_y=False. This change aligns with the expected behavior after the bug fix.
Added assertions for covariance values, which were calculated but not previously verified.
Introduced a new test for normalize_y=True to enhance the robustness of the test suite.
Incorporated a test inspired by the issue report, ensuring comprehensive coverage.

Sep 27 '24 18:09 Lollitor

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`black`

black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.


--- /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py	2024-09-27 18:12:20.854176+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py	2024-09-27 18:12:29.269175+00:00
@@ -392,6 +392,6 @@
 # The effect of using a radial basis function kernel will attenuate the
 # periodicity effect once that no sample are available in the training.
 # As testing samples get further away from the training ones, predictions
 # are converging towards their mean and their standard deviation
 # also increases.
-plt.show()
\ No newline at end of file
+plt.show()
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py
--- /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py	2024-09-27 18:12:20.854176+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py	2024-09-27 18:12:29.355953+00:00
@@ -255,6 +255,5 @@
     f"Kernel parameters after fit: \n{gpr.kernel_} \n"
     f"Log-likelihood: {gpr.log_marginal_likelihood(gpr.kernel_.theta):.3f}"
 )
 
 plt.show()
-
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py	2024-09-27 18:12:20.892177+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py	2024-09-27 18:12:35.610865+00:00
@@ -272,11 +272,10 @@
         # Normalize target value only if normalize_y is True
         if self.normalize_y:
             y = (y - self._y_train_mean) / self._y_train_std
         else:
             self._y_train_mean = np.zeros_like(self._y_train_mean)
-
 
         if np.iterable(self.alpha) and self.alpha.shape[0] != y.shape[0]:
             if self.alpha.shape[0] == 1:
                 self.alpha = self.alpha[0]
             else:
@@ -439,11 +438,10 @@
             y_mean = K_trans @ self.alpha_
 
             # Conditionally undo the normalization
             if self.normalize_y:
                 y_mean = self._y_train_std * y_mean + self._y_train_mean
-            
 
             # if y_mean has shape (n_samples, 1), reshape to (n_samples,)
             if y_mean.ndim > 1 and y_mean.shape[1] == 1:
                 y_mean = np.squeeze(y_mean, axis=1)
 
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py	2024-09-27 18:12:20.893177+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py	2024-09-27 18:12:36.105382+00:00
@@ -341,28 +341,28 @@
     # made by GPy.
     assert_allclose(y_pred_std, y_pred_std_gpy, rtol=0.15, atol=0)
 
 
 def test_gaussian_process_regressor_std_scaling():
-    # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly 
+    # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly
     # between multiple outputs, both with and without output normalization.
-    x = np.array([[0.], [0.5], [1.]])
-    y = np.hstack((x**2, 10*x**2))
-    
+    x = np.array([[0.0], [0.5], [1.0]])
+    y = np.hstack((x**2, 10 * x**2))
+
     # With output normalization
     gpr = GaussianProcessRegressor(normalize_y=True)
     gpr.fit(x, y)
     std = gpr.predict(np.array([[0.25]]), return_std=True)[1][0]
 
-    assert_almost_equal(std[0], std[1]/10, decimal=9)
+    assert_almost_equal(std[0], std[1] / 10, decimal=9)
 
     # Without output normalization
     gpr = GaussianProcessRegressor(normalize_y=False)
     gpr.fit(x, y)
     std = gpr.predict(np.array([[0.25]]), return_std=True)[1][0]
 
-    assert_almost_equal(std[0], std[1]/10, decimal=9)
+    assert_almost_equal(std[0], std[1] / 10, decimal=9)
 
 
 def test_y_multioutput():
     # Test that GPR can deal with multi-dimensional target values
     y_2d = np.vstack((y, y * 2)).T
@@ -370,14 +370,18 @@
     # Test for fixed kernel that first dimension of 2d GP equals the output
     # of 1d GP and that second dimension is twice as large
     kernel = RBF(length_scale=1.0)
 
     for normalize_y in [False, True]:
-        gpr = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
+        gpr = GaussianProcessRegressor(
+            kernel=kernel, optimizer=None, normalize_y=normalize_y
+        )
         gpr.fit(X, y)
 
-        gpr_2d = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
+        gpr_2d = GaussianProcessRegressor(
+            kernel=kernel, optimizer=None, normalize_y=normalize_y
+        )
         gpr_2d.fit(X, y_2d)
 
         y_pred_1d, y_std_1d = gpr.predict(X2, return_std=True)
         y_pred_2d, y_std_2d = gpr_2d.predict(X2, return_std=True)
         _, y_cov_1d = gpr.predict(X2, return_cov=True)
@@ -419,11 +423,10 @@
 
         gpr_2d = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
         gpr_2d.fit(X, np.vstack((y, y)).T)
 
         assert_almost_equal(gpr.kernel_.theta, gpr_2d.kernel_.theta, 4)
-
 
 
 @pytest.mark.parametrize("kernel", non_fixed_kernels)
 def test_custom_optimizer(kernel):
     # Test that GPR can use externally defined optimizers.
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py

Oh no! 💥 💔 💥
4 files would be reformatted, 924 files would be left unchanged.

`ruff`

ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.


examples/gaussian_process/plot_compare_gpr_krr.py:397:11: W292 [*] No newline at end of file
    |
395 | # are converging towards their mean and their standard deviation
396 | # also increases.
397 | plt.show()
    |            W292
    |
    = help: Add trailing newline

sklearn/gaussian_process/_gpr.py:444:1: W293 [*] Blank line contains whitespace
    |
442 |             if self.normalize_y:
443 |                 y_mean = self._y_train_std * y_mean + self._y_train_mean
444 |             
    | ^^^^^^^^^^^^ W293
445 | 
446 |             # if y_mean has shape (n_samples, 1), reshape to (n_samples,)
    |
    = help: Remove whitespace from blank line

sklearn/gaussian_process/tests/test_gpr.py:346:89: E501 Line too long (94 > 88)
    |
345 | def test_gaussian_process_regressor_std_scaling():
346 |     # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly 
    |                                                                                         ^^^^^^ E501
347 |     # between multiple outputs, both with and without output normalization.
348 |     x = np.array([[0.], [0.5], [1.]])
    |

sklearn/gaussian_process/tests/test_gpr.py:346:94: W291 [*] Trailing whitespace
    |
345 | def test_gaussian_process_regressor_std_scaling():
346 |     # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly 
    |                                                                                              ^ W291
347 |     # between multiple outputs, both with and without output normalization.
348 |     x = np.array([[0.], [0.5], [1.]])
    |
    = help: Remove trailing whitespace

sklearn/gaussian_process/tests/test_gpr.py:350:1: W293 [*] Blank line contains whitespace
    |
348 |     x = np.array([[0.], [0.5], [1.]])
349 |     y = np.hstack((x**2, 10*x**2))
350 |     
    | ^^^^ W293
351 |     # With output normalization
352 |     gpr = GaussianProcessRegressor(normalize_y=True)
    |
    = help: Remove whitespace from blank line

sklearn/gaussian_process/tests/test_gpr.py:375:89: E501 Line too long (94 > 88)
    |
374 |     for normalize_y in [False, True]:
375 |         gpr = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
    |                                                                                         ^^^^^^ E501
376 |         gpr.fit(X, y)
    |

sklearn/gaussian_process/tests/test_gpr.py:378:89: E501 Line too long (97 > 88)
    |
376 |         gpr.fit(X, y)
377 | 
378 |         gpr_2d = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
    |                                                                                         ^^^^^^^^^ E501
379 |         gpr_2d.fit(X, y_2d)
    |

Found 7 errors.
[*] 4 fixable with the `--fix` option.

_{Generated for commit: ed1c071. Link to the linter CI: here}

Sep 27 '24 18:09 github-actions[bot]

Sorry for the late reply. Could you please update your branch, and fix linting issues?

Mar 18 '25 11:03 adrinjalali

Can this PR be merged in the next release? If not, I can make a new PR in the coming days.

Sep 01 '25 06:09 mdelozzo

This actually seems to be AI generated and the OP not responding. Feel free to open a new PR @mdelozzo . Also note our policy regarding LLM generated content.

Sep 01 '25 14:09 adrinjalali