handson-ml2 Linear regression example in 2nd Edition book using unprocessed training data

It appears that the data used to test the trained linear regression model on page 75 of the 2nd edition of "Hands-on..." is using the unprocessed housing data frame. If the model was trained with housing_prepared shouldn't the examples (i.e. some_data=housing.iloc[:5]) use the processed data set as well (i.e. some_data=housing_prepared[:5])?

Aug 23 '19 00:08 jsukup

Hi @jsukup , thanks for your question.

Are you referring to this code example?

>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [ 210644.6045  317768.8069  210956.4333  59218.9888  189747.5584]
>>> print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

If so, then notice that it does prepare the data (full_pipeline.transform(some_data)) before it uses the trained model to make predictions (lin_reg.predict(some_data_prepared)).

Hope this helps, Aurélien

Aug 28 '19 07:08 ageron

@ageron Hi! Testing in my own laptop, some_data_prepared (after full_pipeline.transform(some_data)) only contains three different categories, which doesn't match the linear model.

Feb 12 '20 09:02 huang-jl

Hi @huang-jl ,

I can see only two explanations:

Perhaps your full_pipeline was trained on a part of the dataset that only contained three different categories. Instead, the model should be trained on the full training set (as in the book and the notebook), like in this cell:

from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

Perhaps you are calling full_pipeline.fit_transform(some_data) instead of full_pipeline.transform(some_data)? If so, then just replace fit_transform() with transform(): we're only supposed to fit the training set.

Hope this helps.

Mar 31 '20 01:03 ageron

I also ran into same problem some_data_prepared only has 3 categories instead of 5 when I first execute the predict(some_data_prepared)

full_pipeline.named_transformers_['cat'].categories_ lists only 3 categories.

However, after I ran the cell mentioned above again, the issue was resolved without any code change and OneHotEncoder now learns that there are 5 categories and the predict works.

This is super weird though...maybe an internal bug from sklearn

Sep 12 '20 22:09 qingchuanzhu

20201024_105137

I'm also having this same problem just before tthis code

Oct 25 '20 13:10 Aliiiu

hi, on page 75 of the second version of the book, i am having a problem with loading the dataset, after writing the code for downloading it

Jul 14 '22 18:07 Jeremiah004

handson-ml2 handson-ml2 copied to clipboard

Linear regression example in 2nd Edition book using unprocessed training data

handson-ml2
handson-ml2 copied to clipboard