handson-ml Global minimum Chapter4. Page(119)

Hi @ageron,

so in chapter 2 we used the sklearn Linear regressor model and calculated the RMSE. In chapter 4 we learn about normal equation and Gradient descent which help us optimize the cost function. My question is if I were using the Linear Regressor as a black box. How would I know the RMSE I have is the global minimum.? Like how would I appreciate how good the model worked from looking at the RMSE.
In chapter 4 you talk about using Gridsearch to find the appropriate learning rate. I picture Gridesearch as using it on an algorithm with tunable hyperparameters. So how will I do this because the Linear Regressor alg for example has no learning rate hyperparameter. could you please just show like a code which does the gridsearch and assigns a stop tolearnce. Thanks :)

Aug 28 '19 18:08 FritzPeleke

Hi @FritzPeleke , Thanks for your questions.

The good thing about the LinearRegression class is that it uses an analytical approach to find the optimal parameters, so it is mathematically guaranteed to optimally fit the data you give it. However, that does not mean the model will generalize well to new examples. If you measure the model's performance on the validation set, you can use the RMSE, as it is more easily interpretable as the MSE, since it has roughly the same scale as the values you are looking at. But how do you know whether the RMSE is good or not? In short: there's no universal way. So, for example, if you are estimating the price of 500k$ houses and the RMSE is 50k$, as a first approximation, you can think of this as the size of the average error of the model: it estimates house values ±50k$ (in reality, it's not the mean absolute error, since the RMSE weighs large errors more than small errors, but it still gives you a rough idea of the mean error). There's no universal "good" value for the RMSE, it really depends on the task at hand. Perhaps in some case ±50k$ is great, and in others it's horrible. So usually you will compare the RMSE with known baselines, such as existing systems, or human experts.
You are right that using grid search to tweak the learning rate would not make sense if we were using the LinearRegression class, since it does not have a learning_rate hyperparameter. However, when talking about learning rates in chapter 4 (around figure 4-20), the book performs linear regression using the SGDRegressor class, not the LinearRegression class. The difference is that the SGDRegressor class does not use a closed-form solution like LinearRegression does. Instead, it uses stochastic gradient descent (hence the name) and therefore it has a learning rate argument that you can tweak. You can use GridSearchCV to find the best learning rate, just like in this notebook: https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb

Hope this helps.

Sep 06 '19 05:09 ageron

That helps Thanks :). I have another question. So with the book, we do a basic implementation of early stopping of the SGD Regressor. Is it possible to use just hyperparameters of the SGD class like 'early_stopping', 'tol', 'validation_fraction' to stop the algorithm without using a for loop as we did? And how do you know which maximum epoch to use(in the book for example we use, "for epoch in range(1000) was used")?

Sep 07 '19 12:09 FritzPeleke

My pleasure! :)

Yes, you should definitely use these options instead of rolling out your own implementation like I did (I wrote the code before Scikit-Learn 0.20, which introduced these options). Hopefully the for loop makes it clear what goes on under the hood when you use these options.

The maximum epoch usually does not matter much, as long as it's large enough, because early stopping will eventually interrupt training. Unless you set the learning rate too low (making training really too slow), there's not much reason to stop training as long as the validation error keeps going down.

Hope this helps.

Sep 07 '19 15:09 ageron

Hey @ageron, I'm back again :) so I tried using the hyperparameters than the for loop but I get different results. I don't know if I'm using the parameters wrongly. Below is the code for both for loop method and mine with hyperparameters. Please take a look.

#Early Stopping
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)

X_train,X_val,y_train,y_val = train_test_split(X,y.ravel(),test_size=0.2,random_state=10)

poly_scaler = Pipeline([('poly feat', PolynomialFeatures(degree=90,include_bias=False)),
                        ('scaler', StandardScaler())])

x_train_poly_scaled = poly_scaler.fit_transform(X_train)
x_val_poly_scaled = poly_scaler.transform(X_val)

sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None,
                       learning_rate="constant", eta0=0.0005, random_state=42)

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(x_train_poly_scaled, y_train)  # continues where it left off
    y_val_predict = sgd_reg.predict(x_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)
print(best_epoch,minimum_val_error,best_model)

#early stopping using hyperparameters
s_reg = SGDRegressor(penalty=None,max_iter=1000,tol=-np.infty,random_state=42,
                     learning_rate="constant",eta0=0.0005,warm_start=True,
                     early_stopping=True,validation_fraction=0.2,n_iter_no_change=3)
s_reg.fit(x_train_poly_scaled,y_train)
pred = s_reg.predict(x_val_poly_scaled)
print(s_reg.n_iter_,mean_squared_error(y_val,pred))

Sep 07 '19 19:09 FritzPeleke

Do you get very different results, or just slightly different (like changing the random seed)?

Sep 08 '19 13:09 ageron

Here are the results. While I expected the number of iterations required to get the minimum will be the same. I get 116 for the for loop method while using the parameters just gives me the mas iterations which is 1000. Then I requested the MSE they are also both different as below. C:\Users\fritz\Hello\Scripts\python.exe C:/Users/fritz/PycharmProjects/Hello/Chap-4.py 116 MSE =1.1582514122283896 SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1, eta0=0.0005, fit_intercept=True, l1_ratio=0.15, learning_rate='constant', loss='squared_loss', max_iter=1, n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42, shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0, warm_start=True) 1000 MSE = 5.1558101064475024e+20

Process finished with exit code 0

Sep 08 '19 13:09 FritzPeleke