handson-ml2
handson-ml2 copied to clipboard
[QUESTION] [Chapter 2] Unreasonable household sizes (up to 1243 persons per household)
Not sure if this is intentional, but there are some districts with very unreasonable household sizes in the housing dataset, going up to 1000+ persons per household:
To Reproduce
#!/usr/bin/env python3
import pandas as pd
import matplotlib.pyplot as plt
pd.options.display.width = 0
url_csv_raw = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"
housing_data = pd.read_csv(url_csv_raw)
housing_data["household_size"] = housing_data["population"] / housing_data["households"]
print(housing_data["household_size"].describe())
print(housing_data.loc[housing_data["household_size"] > 20, :])
low_households = housing_data.loc[housing_data["households"] < 500, :]
plt.scatter(x=low_households["population"], y=low_households["households"], alpha=0.2)
plt.xlabel("Population")
plt.ylabel("Households")
plt.title("Households < 500")
plt.show()
housing_data.loc[housing_data["household_size"] > 10, "household_size"].hist(bins=20)
plt.xlabel("Household size (showing only >10)")
plt.ylabel("Frequency")
plt.title("Households of size > 10")
plt.show()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value ocean_proximity household_size
3364 -120.51 40.41 36.0 36.0 8.0 4198.0 7.0 5.5179 67500.0 INLAND 599.714286
5986 -117.71 34.10 52.0 567.0 152.0 2688.0 126.0 1.8750 212500.0 INLAND 21.333333
8874 -118.45 34.06 52.0 204.0 34.0 1154.0 28.0 9.3370 500001.0 <1H OCEAN 41.214286
9172 -118.59 34.47 5.0 538.0 98.0 8733.0 105.0 4.2391 154600.0 INLAND 83.171429
12104 -117.33 33.97 8.0 152.0 19.0 1275.0 20.0 1.6250 162500.0 INLAND 63.750000
13034 -121.15 38.69 52.0 240.0 44.0 6675.0 29.0 6.1359 225000.0 INLAND 230.172414
13366 -117.63 33.94 36.0 447.0 95.0 2886.0 85.0 4.2578 183300.0 INLAND 33.952941
16420 -121.29 37.89 26.0 161.0 27.0 1542.0 30.0 5.7485 162500.0 INLAND 51.400000
16669 -120.70 35.32 46.0 118.0 17.0 6532.0 13.0 4.2639 350000.0 NEAR OCEAN 502.461538
19006 -121.98 38.32 45.0 19.0 5.0 7460.0 6.0 10.2264 137500.0 INLAND 1243.333333
Thanks for your feedback. I'm currently traveling for the next few weeks, but I'll check this out when I get back.
@Roland-Pfeiffer you might want to check methods and techniques to remove outliers. The best one that worked for me for the same problem was z-score
@vedanthv : Hi, thanks for your response. I was mainly bringing this up because these outliers don't seem to be removed in the process of the example project, which I thought might degrade the quality of the training data. But I am not very experienced in ML, so I might be wrong. In any case, thank you for taking the time to respond!
Hi @Roland-Pfeiffer ,
Thanks for your feedback. Indeed, these do look like pretty bad outliers. As @vedanthv mentioned, you can try filtering them out (or use any other technique discussed in the book to handle outliers). I'd love to know if it affects the results significantly. If it does, I'll definitely update the book. 👍