A minor error on page 258 and ch08.ipynb (Training a logistic regression model for document classification) when preparing train and test datasets

Open pavlo-yanchenko opened this issue 2 years ago • 1 comments

When we prepare the train and test datasets, we slice the IMDB dataset dataframe with the .loc method (slicing using the index).

X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

It's worth noting that contrary to usual Python slices, .loc includes both the start and the stop points in the result (when they are present in the index). So, it ends up with having the sample #25000 in both train and test datasets.

Aug 14 '23 13:08 pavlo-yanchenko

Great point. I think it's best to switch to .iloc here

Aug 14 '23 13:08 rasbt