Global minimum Chapter4. Page(119)
Hi @ageron,
- so in chapter 2 we used the sklearn Linear regressor model and calculated the RMSE. In chapter 4 we learn about normal equation and Gradient descent which help us optimize the cost function. My question is if I were using the Linear Regressor as a black box. How would I know the RMSE I have is the global minimum.? Like how would I appreciate how good the model worked from looking at the RMSE.
- In chapter 4 you talk about using Gridsearch to find the appropriate learning rate. I picture Gridesearch as using it on an algorithm with tunable hyperparameters. So how will I do this because the Linear Regressor alg for example has no learning rate hyperparameter. could you please just show like a code which does the gridsearch and assigns a stop tolearnce. Thanks :)
Hi @FritzPeleke , Thanks for your questions.
-
The good thing about the
LinearRegressionclass is that it uses an analytical approach to find the optimal parameters, so it is mathematically guaranteed to optimally fit the data you give it. However, that does not mean the model will generalize well to new examples. If you measure the model's performance on the validation set, you can use the RMSE, as it is more easily interpretable as the MSE, since it has roughly the same scale as the values you are looking at. But how do you know whether the RMSE is good or not? In short: there's no universal way. So, for example, if you are estimating the price of 500k$ houses and the RMSE is 50k$, as a first approximation, you can think of this as the size of the average error of the model: it estimates house values ±50k$ (in reality, it's not the mean absolute error, since the RMSE weighs large errors more than small errors, but it still gives you a rough idea of the mean error). There's no universal "good" value for the RMSE, it really depends on the task at hand. Perhaps in some case ±50k$ is great, and in others it's horrible. So usually you will compare the RMSE with known baselines, such as existing systems, or human experts. -
You are right that using grid search to tweak the learning rate would not make sense if we were using the
LinearRegressionclass, since it does not have alearning_ratehyperparameter. However, when talking about learning rates in chapter 4 (around figure 4-20), the book performs linear regression using theSGDRegressorclass, not theLinearRegressionclass. The difference is that theSGDRegressorclass does not use a closed-form solution likeLinearRegressiondoes. Instead, it uses stochastic gradient descent (hence the name) and therefore it has a learning rate argument that you can tweak. You can useGridSearchCVto find the best learning rate, just like in this notebook: https://github.com/ageron/handson-ml2/blob/master/02_end_to_end_machine_learning_project.ipynb
Hope this helps.
That helps Thanks :). I have another question. So with the book, we do a basic implementation of early stopping of the SGD Regressor. Is it possible to use just hyperparameters of the SGD class like 'early_stopping', 'tol', 'validation_fraction' to stop the algorithm without using a for loop as we did? And how do you know which maximum epoch to use(in the book for example we use, "for epoch in range(1000) was used")?
My pleasure! :)
Yes, you should definitely use these options instead of rolling out your own implementation like I did (I wrote the code before Scikit-Learn 0.20, which introduced these options). Hopefully the for loop makes it clear what goes on under the hood when you use these options.
The maximum epoch usually does not matter much, as long as it's large enough, because early stopping will eventually interrupt training. Unless you set the learning rate too low (making training really too slow), there's not much reason to stop training as long as the validation error keeps going down.
Hope this helps.
Hey @ageron, I'm back again :)
so I tried using the hyperparameters than the for loop but I get different results. I don't know if I'm using the parameters wrongly. Below is the code for both for loop method and mine with hyperparameters. Please take a look.
#Early Stopping
np.random.seed(42)
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 2 + X + 0.5 * X**2 + np.random.randn(m, 1)
X_train,X_val,y_train,y_val = train_test_split(X,y.ravel(),test_size=0.2,random_state=10)
poly_scaler = Pipeline([('poly feat', PolynomialFeatures(degree=90,include_bias=False)),
('scaler', StandardScaler())])
x_train_poly_scaled = poly_scaler.fit_transform(X_train)
x_val_poly_scaled = poly_scaler.transform(X_val)
sgd_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None,
learning_rate="constant", eta0=0.0005, random_state=42)
minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
sgd_reg.fit(x_train_poly_scaled, y_train) # continues where it left off
y_val_predict = sgd_reg.predict(x_val_poly_scaled)
val_error = mean_squared_error(y_val, y_val_predict)
if val_error < minimum_val_error:
minimum_val_error = val_error
best_epoch = epoch
best_model = clone(sgd_reg)
print(best_epoch,minimum_val_error,best_model)
#early stopping using hyperparameters
s_reg = SGDRegressor(penalty=None,max_iter=1000,tol=-np.infty,random_state=42,
learning_rate="constant",eta0=0.0005,warm_start=True,
early_stopping=True,validation_fraction=0.2,n_iter_no_change=3)
s_reg.fit(x_train_poly_scaled,y_train)
pred = s_reg.predict(x_val_poly_scaled)
print(s_reg.n_iter_,mean_squared_error(y_val,pred))
Do you get very different results, or just slightly different (like changing the random seed)?
Here are the results. While I expected the number of iterations required to get the minimum will be the same. I get 116 for the for loop method while using the parameters just gives me the mas iterations which is 1000. Then I requested the MSE they are also both different as below. C:\Users\fritz\Hello\Scripts\python.exe C:/Users/fritz/PycharmProjects/Hello/Chap-4.py 116 MSE =1.1582514122283896 SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1, eta0=0.0005, fit_intercept=True, l1_ratio=0.15, learning_rate='constant', loss='squared_loss', max_iter=1, n_iter_no_change=5, penalty=None, power_t=0.25, random_state=42, shuffle=True, tol=-inf, validation_fraction=0.1, verbose=0, warm_start=True) 1000 MSE = 5.1558101064475024e+20
Process finished with exit code 0