handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

Chap 2 End to end, Prepare data has sample_incomplete_rows different from books

Open hypntzed78 opened this issue 4 years ago • 1 comments

book is 2nd edition 2019 I don't understand this sample_incomplete_rows = housing[housing.isnull().any(axis=1)] why it differ from books

hypntzed78 avatar Sep 16 '21 10:09 hypntzed78

Hi @ladylazy9x , Thanks for your question.

The book describes 3 different ways you can deal with null values:

# option 1 = just drop the rows that contain null values
housing.dropna(subset=["total_bedrooms"])

# option 2 = don't drop the rows, just drop the total_bedrooms column
housing.drop("total_bedrooms", axis=1)

# option 3 = replace the null values with the median value of the column
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

To demo these techniques in the notebook, I could have copy/pasted this exact code, but the problem is that the housing DataFrame is quite long, and there are very few rows containing null values. So I made a small temporary copy of the housing DataFrame containing only the rows with null values, and I showed the 3 techniques on that small copy.

sample_incomplete_rows = housing[housing.isnull().any(axis=1)] creates a new DataFrame copied from housing but with only the rows that contain at least one null value. Let's break this down:

  • housing.isnull() returns a DataFrame of the same shape as housing but with all the values replaced with True (if the original value was null) or False otherwise.
  • housing.isnull().any(axis=1) takes the DataFrame full of booleans, and it checks each row (axis=1): if any value is True, it returns True, otherwise it returns False. So the result is a Series object that contains one boolean value per row: for every row in the housing DataFrame that contains at least one null value, it is True, otherwise it is False.
  • Lastly, we use this Series as an index to get only the rows we're interested in from the housing DataFrame.

Is this clearer?

ageron avatar Sep 23 '21 11:09 ageron