Fix normalization bug in GaussianProcessRegressor
Fixes #29697
Bug Fix:
- Normalization: Corrected the normalization logic in the
fitmethod to ensure consistent handling of the_y_train_meanand_y_train_stdvariables, regardless of whethernormalize_yisTrueorFalse. - Modified the prediction logic to apply the inverse transformation only if
normalize_yisTrue, ensuring predictions are correctly scaled.
Test Suite Correction:
- Updated
test_y_multioutputto correctly handle cases where the second output's standard deviation should scale differently whennormalize_y=False. This change aligns with the expected behavior after the bug fix. - Added assertions for covariance values, which were calculated but not previously verified.
- Introduced a new test for
normalize_y=Trueto enhance the robustness of the test suite. - Incorporated a test inspired by the issue report, ensuring comprehensive coverage.
❌ Linting issues
This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.
You can see the details of the linting issues under the lint job here
black
black detected issues. Please run black . locally and push the changes. Here you can see the detected issues. Note that running black might also fix some of the issues which might be detected by ruff. Note that the installed black version is black=24.3.0.
--- /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py 2024-09-27 18:12:20.854176+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py 2024-09-27 18:12:29.269175+00:00
@@ -392,6 +392,6 @@
# The effect of using a radial basis function kernel will attenuate the
# periodicity effect once that no sample are available in the training.
# As testing samples get further away from the training ones, predictions
# are converging towards their mean and their standard deviation
# also increases.
-plt.show()
\ No newline at end of file
+plt.show()
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_compare_gpr_krr.py
--- /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py 2024-09-27 18:12:20.854176+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py 2024-09-27 18:12:29.355953+00:00
@@ -255,6 +255,5 @@
f"Kernel parameters after fit: \n{gpr.kernel_} \n"
f"Log-likelihood: {gpr.log_marginal_likelihood(gpr.kernel_.theta):.3f}"
)
plt.show()
-
would reformat /home/runner/work/scikit-learn/scikit-learn/examples/gaussian_process/plot_gpr_prior_posterior.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py 2024-09-27 18:12:20.892177+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py 2024-09-27 18:12:35.610865+00:00
@@ -272,11 +272,10 @@
# Normalize target value only if normalize_y is True
if self.normalize_y:
y = (y - self._y_train_mean) / self._y_train_std
else:
self._y_train_mean = np.zeros_like(self._y_train_mean)
-
if np.iterable(self.alpha) and self.alpha.shape[0] != y.shape[0]:
if self.alpha.shape[0] == 1:
self.alpha = self.alpha[0]
else:
@@ -439,11 +438,10 @@
y_mean = K_trans @ self.alpha_
# Conditionally undo the normalization
if self.normalize_y:
y_mean = self._y_train_std * y_mean + self._y_train_mean
-
# if y_mean has shape (n_samples, 1), reshape to (n_samples,)
if y_mean.ndim > 1 and y_mean.shape[1] == 1:
y_mean = np.squeeze(y_mean, axis=1)
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/_gpr.py
--- /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py 2024-09-27 18:12:20.893177+00:00
+++ /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py 2024-09-27 18:12:36.105382+00:00
@@ -341,28 +341,28 @@
# made by GPy.
assert_allclose(y_pred_std, y_pred_std_gpy, rtol=0.15, atol=0)
def test_gaussian_process_regressor_std_scaling():
- # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly
+ # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly
# between multiple outputs, both with and without output normalization.
- x = np.array([[0.], [0.5], [1.]])
- y = np.hstack((x**2, 10*x**2))
-
+ x = np.array([[0.0], [0.5], [1.0]])
+ y = np.hstack((x**2, 10 * x**2))
+
# With output normalization
gpr = GaussianProcessRegressor(normalize_y=True)
gpr.fit(x, y)
std = gpr.predict(np.array([[0.25]]), return_std=True)[1][0]
- assert_almost_equal(std[0], std[1]/10, decimal=9)
+ assert_almost_equal(std[0], std[1] / 10, decimal=9)
# Without output normalization
gpr = GaussianProcessRegressor(normalize_y=False)
gpr.fit(x, y)
std = gpr.predict(np.array([[0.25]]), return_std=True)[1][0]
- assert_almost_equal(std[0], std[1]/10, decimal=9)
+ assert_almost_equal(std[0], std[1] / 10, decimal=9)
def test_y_multioutput():
# Test that GPR can deal with multi-dimensional target values
y_2d = np.vstack((y, y * 2)).T
@@ -370,14 +370,18 @@
# Test for fixed kernel that first dimension of 2d GP equals the output
# of 1d GP and that second dimension is twice as large
kernel = RBF(length_scale=1.0)
for normalize_y in [False, True]:
- gpr = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
+ gpr = GaussianProcessRegressor(
+ kernel=kernel, optimizer=None, normalize_y=normalize_y
+ )
gpr.fit(X, y)
- gpr_2d = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
+ gpr_2d = GaussianProcessRegressor(
+ kernel=kernel, optimizer=None, normalize_y=normalize_y
+ )
gpr_2d.fit(X, y_2d)
y_pred_1d, y_std_1d = gpr.predict(X2, return_std=True)
y_pred_2d, y_std_2d = gpr_2d.predict(X2, return_std=True)
_, y_cov_1d = gpr.predict(X2, return_cov=True)
@@ -419,11 +423,10 @@
gpr_2d = GaussianProcessRegressor(kernel=kernel, normalize_y=True)
gpr_2d.fit(X, np.vstack((y, y)).T)
assert_almost_equal(gpr.kernel_.theta, gpr_2d.kernel_.theta, 4)
-
@pytest.mark.parametrize("kernel", non_fixed_kernels)
def test_custom_optimizer(kernel):
# Test that GPR can use externally defined optimizers.
would reformat /home/runner/work/scikit-learn/scikit-learn/sklearn/gaussian_process/tests/test_gpr.py
Oh no! 💥 💔 💥
4 files would be reformatted, 924 files would be left unchanged.
ruff
ruff detected issues. Please run ruff check --fix --output-format=full . locally, fix the remaining issues, and push the changes. Here you can see the detected issues. Note that the installed ruff version is ruff=0.5.1.
examples/gaussian_process/plot_compare_gpr_krr.py:397:11: W292 [*] No newline at end of file
|
395 | # are converging towards their mean and their standard deviation
396 | # also increases.
397 | plt.show()
| W292
|
= help: Add trailing newline
sklearn/gaussian_process/_gpr.py:444:1: W293 [*] Blank line contains whitespace
|
442 | if self.normalize_y:
443 | y_mean = self._y_train_std * y_mean + self._y_train_mean
444 |
| ^^^^^^^^^^^^ W293
445 |
446 | # if y_mean has shape (n_samples, 1), reshape to (n_samples,)
|
= help: Remove whitespace from blank line
sklearn/gaussian_process/tests/test_gpr.py:346:89: E501 Line too long (94 > 88)
|
345 | def test_gaussian_process_regressor_std_scaling():
346 | # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly
| ^^^^^^ E501
347 | # between multiple outputs, both with and without output normalization.
348 | x = np.array([[0.], [0.5], [1.]])
|
sklearn/gaussian_process/tests/test_gpr.py:346:94: W291 [*] Trailing whitespace
|
345 | def test_gaussian_process_regressor_std_scaling():
346 | # Test that the standard deviations predicted by GaussianProcessRegressor scale correctly
| ^ W291
347 | # between multiple outputs, both with and without output normalization.
348 | x = np.array([[0.], [0.5], [1.]])
|
= help: Remove trailing whitespace
sklearn/gaussian_process/tests/test_gpr.py:350:1: W293 [*] Blank line contains whitespace
|
348 | x = np.array([[0.], [0.5], [1.]])
349 | y = np.hstack((x**2, 10*x**2))
350 |
| ^^^^ W293
351 | # With output normalization
352 | gpr = GaussianProcessRegressor(normalize_y=True)
|
= help: Remove whitespace from blank line
sklearn/gaussian_process/tests/test_gpr.py:375:89: E501 Line too long (94 > 88)
|
374 | for normalize_y in [False, True]:
375 | gpr = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
| ^^^^^^ E501
376 | gpr.fit(X, y)
|
sklearn/gaussian_process/tests/test_gpr.py:378:89: E501 Line too long (97 > 88)
|
376 | gpr.fit(X, y)
377 |
378 | gpr_2d = GaussianProcessRegressor(kernel=kernel, optimizer=None, normalize_y=normalize_y)
| ^^^^^^^^^ E501
379 | gpr_2d.fit(X, y_2d)
|
Found 7 errors.
[*] 4 fixable with the `--fix` option.
Sorry for the late reply. Could you please update your branch, and fix linting issues?
Can this PR be merged in the next release? If not, I can make a new PR in the coming days.
This actually seems to be AI generated and the OP not responding. Feel free to open a new PR @mdelozzo . Also note our policy regarding LLM generated content.