fastbook
fastbook copied to clipboard
09_tabular: ProductSize histogram's y-axis is mislabeled
Problem
The book's histogram of ProductSizes in the "Partial Dependence" section has a mislabeled y-axis. Consequently, the histogram communicates the wrong counts for some of the ProductSizes. Here are some ProductSizes it mislabeled:
ProductSize | Correct Count | Book's Incorrect Count |
---|---|---|
Large | 280 | ~500 |
Mini | 627 | ~100 |
See below for details.
Book's incorrect histogram
The "Partial Dependence" section has a ProductSize histogram that is produced by this code:
p = valid_xs_final['ProductSize'].value_counts(sort=False).plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);
and renders like this:
Corrected histogram
We can reveal the mistake in the book's histogram by inspecting a textual histogram from the dataframe:
cond = (df.saleYear<2011) | (df.saleMonth<10)
df_valid = df[~cond]
df_valid.ProductSize.value_counts(dropna=False)
That code produces this textual histogram:
NaN 3930
Medium 1331
Large / Medium 1223
Mini 627
Small 484
Large 280
Compact 113
Name: ProductSize, dtype: int64
See the table at the top of this issue for a comparison between the counts of these ProductSizes and the ones from the book's histogram.
Cause
The problem is that the code that labels the y-axis assumes that the bottom bar is ProductSize 0, the next bar is ProductSize 1, etc. but this isn't the case. The bars do not appear to be ordered by ProductSize.
Example fix
Here's some code that properly labels the y-axis by sorting the y-axis labels to match the order of the bars:
counts = valid_xs_final['ProductSize'].value_counts(sort=False)
p = counts.plot.barh()
c = [to.classes['ProductSize'][i] for i in counts.index.values]
plt.yticks(range(len(c)), c)
Looks like a fix was submitted in pull request #410.
I can confirm this issue; I ran into it while doing my own notes. My fix was as follows:
p = valid_xs_final['ProductSize'].value_counts(sort=False).sort_index().plot.barh()
c = to.classes['ProductSize']
plt.yticks(range(len(c)), c);