handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

Chapter 2 Analyze the Best Models and Their Errors

Open Ilya-Curie opened this issue 3 years ago • 3 comments

To analyze the relative importance of each attribute for making accurate predictions, the book use the next code (with its output):

>>> feature_importances = grid_search.best_estimator_.feature_importances_
>>> feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])

And to add their corresponding attribute names:

>>> extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
>>> cat_encoder = full_pipeline.named_transformers_["cat"]
>>> cat_one_hot_attribs = list(cat_encoder.categories_[0])_

>>> attributes = num_attribs + extra_attribs + cat_one_hot_attribs
>>> sorted(zip(feature_importances, attributes), reverse=True)
[(0.3661589806181342, 'median_income'),
(0.1647809935615905, 'INLAND'),
(0.10879295677551573, 'pop_per_hhold'),
(0.07334423551601242, 'longitude'),
(0.0629090704826203, 'latitude'),
(0.05641917918195401, 'rooms_per_hhold'),
(0.05335107734767581, 'bedrooms_per_room'),
(0.041143798478729635, 'housing_median_age'),
(0.014874280890402767, 'population'),
(0.014672685420543237, 'total_rooms'),
(0.014257599323407807, 'households'),
(0.014106483453584102, 'total_bedrooms'),
(0.010311488326303787, '<1H OCEAN'),
(0.002856474637320158, 'NEAR OCEAN'),
(0.00196041559947807, 'NEAR BAY'),
(6.028038672736599e-05, 'ISLAND')]

My question is: Why do I have to add extra_attribs? or How do I know that I must add this attributes?

I add the output without add extra_attribs

>>> feature_importances=grid_search.best_estimator_.feature_importances_

>>> #extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
>>> cat_encoder = full_pipeline.named_transformers_["cat"]
>>> cat_one_hot_attribs = list(cat_encoder.categories_[0])

>>> attributes = num_attribs + cat_one_hot_attribs #+ extra_attribs 
>>> sorted(zip(feature_importances, attributes), reverse=True)

[(0.303268232301214, 'median_income'),
 (0.1730639450304893, 'NEAR OCEAN'),
 (0.10895862174634888, 'INLAND'),
 (0.0844196144263057, 'ISLAND'),
 (0.07557206707255014, 'longitude'),
 (0.06398786252477989, 'latitude'),
 (0.06315655490931624, '<1H OCEAN'),
 (0.04240720593117474, 'housing_median_age'),
 (0.01829282732311651, 'total_rooms'),
 (0.017560189966804522, 'population'),
 (0.01689244166020893, 'total_bedrooms'),
 (0.01668817806453196, 'households'),
 (0.008535150622100876, 'NEAR BAY')]

How do I know that is wrong? Because without extra_attribs I can not say apparently only one ocean_proximity category is really useful, so you could try dropping the others

Thanks for your time.

Ilya-Curie avatar Sep 08 '20 02:09 Ilya-Curie

feature_importances is a list giving the importance of features in the same order as your input data presented them. In the example case in the book the full_pipeline was used. And the resulting Dataframe (I think it was called housing_num_tr) has a certain order of its columns. This holds true after converting it to numpy using DataFrameSelector().

Python zip does just connects the n-th entry of feature_importances with the n-th entry of attributes. Therefore, you have to:

  • know the order of both lists
  • make sure that they fit
  • if one list is shorter than the other, zip will just stop after reaching the last entry of the shorter list

In your example the entries belonging to num_attribs are correctly connected to its feature_importances. After that you started connected the the feature_importances of the created extra_attribs to the first three entries of cat_one_hot_attribs.

TobiLang avatar Sep 17 '20 10:09 TobiLang

Hey, I did not understand this step also :(

ahmad-alismail avatar Feb 20 '22 04:02 ahmad-alismail

Thanks for your question @Ilya-Curie , and thanks for the excellent answer @TobiLang , that's exactly right.

The feature_importances list contains the importance of the model's input features, in the same order as they were used to train the model, so if we want to know the names of the most importance features, we must find out what the names of every input feature that was used to train the model, in the right order. That's what attributes is. To train the model, we used housing_prepared, which was created by the full_pipeline, like this:

class CombinedAttributesAdder(...):
    [...]

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', CombinedAttributesAdder()),
        ('std_scaler', StandardScaler()),
    ])

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)

The full_pipeline starts by applying the num_pipeline to the numerical attributes: this includes all the original attributes except "ocean_proximity": specifically, it's these 8 attributes: "longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income". The numerical pipeline starts with an imputer, which fills in missing values, but it does not change the list of feature names. Then comes the CombinedAttributesAdder, which adds three extra attributes: "rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room". This is where these extra attributes come from. Next comes the standard scaler, which scales the values but does not change the list of feature names.

Back to the full_pipeline: it also applies a OneHotEncoder to the categorical attributes: in this case, there's just one categorical attribute, the "ocean_proximity". After one-hot encoding, this attribute gets replaced with one attribute per category. We need to list them in the correct order, which is the same order as in the categories_ attribute of the OneHotEncoder.

So the final list of attributes is:

  • "longitude", "latitude", "housing_median_age", "total_rooms", "total_bedrooms", "population", "households", "median_income" (the original numerical attributes)
  • plus: "rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room" (added by the CombinedAttributesAdder)
  • plus: "<1H OCEAN", "INLAND", "ISLAND", "NEAR BAY", "NEAR OCEAN"

Once we have a list of feature names in the right order, we can use zip(feature_names, attributes) to match each importance value in the first list, with the right feature name in the second list.

Hope this helps.

ageron avatar Feb 22 '22 04:02 ageron