handson-ml2 icon indicating copy to clipboard operation
handson-ml2 copied to clipboard

[QUESTION] [Chapter 2] Unreasonable household sizes (up to 1243 persons per household)

Open Roland-Pfeiffer opened this issue 2 years ago • 4 comments

Not sure if this is intentional, but there are some districts with very unreasonable household sizes in the housing dataset, going up to 1000+ persons per household:

To Reproduce

#!/usr/bin/env python3

import pandas as pd
import matplotlib.pyplot as plt

pd.options.display.width = 0

url_csv_raw = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"
housing_data = pd.read_csv(url_csv_raw)
housing_data["household_size"] = housing_data["population"] / housing_data["households"]
print(housing_data["household_size"].describe())
print(housing_data.loc[housing_data["household_size"] > 20, :])

low_households = housing_data.loc[housing_data["households"] < 500, :]
plt.scatter(x=low_households["population"], y=low_households["households"], alpha=0.2)
plt.xlabel("Population")
plt.ylabel("Households")
plt.title("Households < 500")
plt.show()

housing_data.loc[housing_data["household_size"] > 10, "household_size"].hist(bins=20)
plt.xlabel("Household size (showing only >10)")
plt.ylabel("Frequency")
plt.title("Households of size > 10")
plt.show()

       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  population  households  median_income  median_house_value ocean_proximity  household_size
3364     -120.51     40.41                36.0         36.0             8.0      4198.0         7.0         5.5179             67500.0          INLAND      599.714286
5986     -117.71     34.10                52.0        567.0           152.0      2688.0       126.0         1.8750            212500.0          INLAND       21.333333
8874     -118.45     34.06                52.0        204.0            34.0      1154.0        28.0         9.3370            500001.0       <1H OCEAN       41.214286
9172     -118.59     34.47                 5.0        538.0            98.0      8733.0       105.0         4.2391            154600.0          INLAND       83.171429
12104    -117.33     33.97                 8.0        152.0            19.0      1275.0        20.0         1.6250            162500.0          INLAND       63.750000
13034    -121.15     38.69                52.0        240.0            44.0      6675.0        29.0         6.1359            225000.0          INLAND      230.172414
13366    -117.63     33.94                36.0        447.0            95.0      2886.0        85.0         4.2578            183300.0          INLAND       33.952941
16420    -121.29     37.89                26.0        161.0            27.0      1542.0        30.0         5.7485            162500.0          INLAND       51.400000
16669    -120.70     35.32                46.0        118.0            17.0      6532.0        13.0         4.2639            350000.0      NEAR OCEAN      502.461538
19006    -121.98     38.32                45.0         19.0             5.0      7460.0         6.0        10.2264            137500.0          INLAND     1243.333333

Figure_1_households

Figure_1_households_hist

Roland-Pfeiffer avatar Jul 17 '22 23:07 Roland-Pfeiffer

Thanks for your feedback. I'm currently traveling for the next few weeks, but I'll check this out when I get back.

ageron avatar Jul 18 '22 09:07 ageron

@Roland-Pfeiffer you might want to check methods and techniques to remove outliers. The best one that worked for me for the same problem was z-score

vedanthv avatar Aug 03 '22 04:08 vedanthv

@vedanthv : Hi, thanks for your response. I was mainly bringing this up because these outliers don't seem to be removed in the process of the example project, which I thought might degrade the quality of the training data. But I am not very experienced in ML, so I might be wrong. In any case, thank you for taking the time to respond!

Roland-Pfeiffer avatar Aug 31 '22 13:08 Roland-Pfeiffer

Hi @Roland-Pfeiffer ,

Thanks for your feedback. Indeed, these do look like pretty bad outliers. As @vedanthv mentioned, you can try filtering them out (or use any other technique discussed in the book to handle outliers). I'd love to know if it affects the results significantly. If it does, I'll definitely update the book. 👍

ageron avatar Sep 25 '22 22:09 ageron