Chap 2 End to end, Prepare data has sample_incomplete_rows different from books

Open hypntzed78 opened this issue 4 years ago • 1 comments

book is 2nd edition 2019 I don't understand this sample_incomplete_rows = housing[housing.isnull().any(axis=1)] why it differ from books

Sep 16 '21 10:09 hypntzed78

Hi @ladylazy9x , Thanks for your question.

The book describes 3 different ways you can deal with null values:

# option 1 = just drop the rows that contain null values
housing.dropna(subset=["total_bedrooms"])

# option 2 = don't drop the rows, just drop the total_bedrooms column
housing.drop("total_bedrooms", axis=1)

# option 3 = replace the null values with the median value of the column
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)

To demo these techniques in the notebook, I could have copy/pasted this exact code, but the problem is that the housing DataFrame is quite long, and there are very few rows containing null values. So I made a small temporary copy of the housing DataFrame containing only the rows with null values, and I showed the 3 techniques on that small copy.

sample_incomplete_rows = housing[housing.isnull().any(axis=1)] creates a new DataFrame copied from housing but with only the rows that contain at least one null value. Let's break this down:

housing.isnull() returns a DataFrame of the same shape as housing but with all the values replaced with True (if the original value was null) or False otherwise.
housing.isnull().any(axis=1) takes the DataFrame full of booleans, and it checks each row (axis=1): if any value is True, it returns True, otherwise it returns False. So the result is a Series object that contains one boolean value per row: for every row in the housing DataFrame that contains at least one null value, it is True, otherwise it is False.
Lastly, we use this Series as an index to get only the rows we're interested in from the housing DataFrame.

Is this clearer?

Sep 23 '21 11:09 ageron