introduction-datascience-python-book
introduction-datascience-python-book copied to clipboard
Minor correction to chapter 3: logic error, outliers treatment
In section 3.3.3 about Outliers Treatment it suggests that we can clean up values that exceed the median by 2 or 3 deviation standard:
df2 = df.drop( df.index[(df.income =='>50K\n') &
(df['age'] > df[’age’].median() + 35) &
(df['age'] > df[’age’].median() -15)
])
This boolean indexed is erroneous because it only cleans values that are more than 35 above the median. A correction might be changing operators > by <, and & by |:
df2 = df.drop( df.index[(df.income=='>50K\n') &
((df['age'] > df['age].median() + 35) | (df['age'] < df['age].median() - 15))])