handson-ml2
handson-ml2 copied to clipboard
Linear regression example in 2nd Edition book using unprocessed training data
It appears that the data used to test the trained linear regression model on page 75 of the 2nd edition of "Hands-on..." is using the unprocessed housing
data frame. If the model was trained with housing_prepared
shouldn't the examples (i.e. some_data=housing.iloc[:5]
) use the processed data set as well (i.e. some_data=housing_prepared[:5])?
Hi @jsukup , thanks for your question.
Are you referring to this code example?
>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [ 210644.6045 317768.8069 210956.4333 59218.9888 189747.5584]
>>> print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]
If so, then notice that it does prepare the data (full_pipeline.transform(some_data)
) before it uses the trained model to make predictions (lin_reg.predict(some_data_prepared)
).
Hope this helps, Aurélien
@ageron Hi!
Testing in my own laptop, some_data_prepared
(after full_pipeline.transform(some_data)
) only contains three different categories, which doesn't match the linear model.
Hi @huang-jl ,
I can see only two explanations:
- Perhaps your
full_pipeline
was trained on a part of the dataset that only contained three different categories. Instead, the model should be trained on the full training set (as in the book and the notebook), like in this cell:
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
- Perhaps you are calling
full_pipeline.fit_transform(some_data)
instead offull_pipeline.transform(some_data)
? If so, then just replacefit_transform()
withtransform()
: we're only supposed to fit the training set.
Hope this helps.
I also ran into same problem some_data_prepared
only has 3 categories instead of 5 when I first execute the predict(some_data_prepared)
full_pipeline.named_transformers_['cat'].categories_
lists only 3 categories.
However, after I ran the cell mentioned above again, the issue was resolved without any code change and OneHotEncoder now learns that there are 5 categories and the predict
works.
This is super weird though...maybe an internal bug from sklearn
I'm also having this same problem just before tthis code
hi, on page 75 of the second version of the book, i am having a problem with loading the dataset, after writing the code for downloading it