Chap 2 End to end, Prepare data has sample_incomplete_rows different from books
book is 2nd edition 2019
I don't understand this
sample_incomplete_rows = housing[housing.isnull().any(axis=1)]
why it differ from books
Hi @ladylazy9x , Thanks for your question.
The book describes 3 different ways you can deal with null values:
# option 1 = just drop the rows that contain null values
housing.dropna(subset=["total_bedrooms"])
# option 2 = don't drop the rows, just drop the total_bedrooms column
housing.drop("total_bedrooms", axis=1)
# option 3 = replace the null values with the median value of the column
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median, inplace=True)
To demo these techniques in the notebook, I could have copy/pasted this exact code, but the problem is that the housing DataFrame is quite long, and there are very few rows containing null values. So I made a small temporary copy of the housing DataFrame containing only the rows with null values, and I showed the 3 techniques on that small copy.
sample_incomplete_rows = housing[housing.isnull().any(axis=1)] creates a new DataFrame copied from housing but with only the rows that contain at least one null value. Let's break this down:
housing.isnull()returns a DataFrame of the same shape ashousingbut with all the values replaced withTrue(if the original value was null) orFalseotherwise.housing.isnull().any(axis=1)takes the DataFrame full of booleans, and it checks each row (axis=1): if any value isTrue, it returnsTrue, otherwise it returnsFalse. So the result is aSeriesobject that contains one boolean value per row: for every row in thehousingDataFrame that contains at least one null value, it isTrue, otherwise it isFalse.- Lastly, we use this Series as an index to get only the rows we're interested in from the
housingDataFrame.
Is this clearer?