pycon-pandas-tutorial
pycon-pandas-tutorial copied to clipboard
NaN values in cast.title prevent c.loc[] from being fast after .sort_index()
Hi Brandon, this is a wonderful tutorial, thanks so much for making it.
The issue I encountered is that running c = cast.set_index(['title']).sort_index()
didn't speed up the subsequent search with c.loc['Sleuth']
. (This relates to the one-minute segment of the tutorial on YouTube starting at 1:08:54 https://youtu.be/5JnMutdy6Fw?t=4134)
I think the problem is that my version of the cast
dataframe has six movies with NaN
as the title
. When these NaN
values get into the index, sorting the index doesn't speed up c.loc['Sleuth']
. At least I think this is true based on testing I did with randomly generated dataframes with and without NaN
in the index.
I fixed it by making a copy of the cast
dataframe without those six movies (the movies with NaN
titles), followed by setting and sorting the title
index, like this:
c = cast[cast.title.notnull()]
c = c.set_index(['title']).sort_index()
Running c.loc['Sleuth']
on this new NaN
-free dataframe is very fast, as expected.
It's possible that I made a mistake when downloading the original data and running the build to make cast
. Either way, I thought I should mention this in case someone else has the same issue.