handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

Linear regression example in 2nd Edition book using unprocessed training data

Open jsukup opened this issue 5 years ago • 6 comments

It appears that the data used to test the trained linear regression model on page 75 of the 2nd edition of "Hands-on..." is using the unprocessed housing data frame. If the model was trained with housing_prepared shouldn't the examples (i.e. some_data=housing.iloc[:5]) use the processed data set as well (i.e. some_data=housing_prepared[:5])?

jsukup avatar Aug 23 '19 00:08 jsukup

Hi @jsukup , thanks for your question.

Are you referring to this code example?

>>> some_data = housing.iloc[:5]
>>> some_labels = housing_labels.iloc[:5]
>>> some_data_prepared = full_pipeline.transform(some_data)
>>> print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [ 210644.6045  317768.8069  210956.4333  59218.9888  189747.5584]
>>> print("Labels:", list(some_labels))
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

If so, then notice that it does prepare the data (full_pipeline.transform(some_data)) before it uses the trained model to make predictions (lin_reg.predict(some_data_prepared)).

Hope this helps, Aurélien

ageron avatar Aug 28 '19 07:08 ageron

@ageron Hi! Testing in my own laptop, some_data_prepared (after full_pipeline.transform(some_data)) only contains three different categories, which doesn't match the linear model.

huang-jl avatar Feb 12 '20 09:02 huang-jl

Hi @huang-jl ,

I can see only two explanations:

  1. Perhaps your full_pipeline was trained on a part of the dataset that only contained three different categories. Instead, the model should be trained on the full training set (as in the book and the notebook), like in this cell:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
  1. Perhaps you are calling full_pipeline.fit_transform(some_data) instead of full_pipeline.transform(some_data)? If so, then just replace fit_transform() with transform(): we're only supposed to fit the training set.

Hope this helps.

ageron avatar Mar 31 '20 01:03 ageron

I also ran into same problem some_data_prepared only has 3 categories instead of 5 when I first execute the predict(some_data_prepared)

full_pipeline.named_transformers_['cat'].categories_ lists only 3 categories.

However, after I ran the cell mentioned above again, the issue was resolved without any code change and OneHotEncoder now learns that there are 5 categories and the predict works.

This is super weird though...maybe an internal bug from sklearn

qingchuanzhu avatar Sep 12 '20 22:09 qingchuanzhu

20201024_105137

I'm also having this same problem just before tthis code

Aliiiu avatar Oct 25 '20 13:10 Aliiiu

hi, on page 75 of the second version of the book, i am having a problem with loading the dataset, after writing the code for downloading it

Jeremiah004 avatar Jul 14 '22 18:07 Jeremiah004