machine-learning-book
machine-learning-book copied to clipboard
A minor error on page 258 and ch08.ipynb (Training a logistic regression model for document classification) when preparing train and test datasets
When we prepare the train and test datasets, we slice the IMDB dataset dataframe with the .loc method (slicing using the index).
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values
It's worth noting that contrary to usual Python slices, .loc includes both the start and the stop points in the result (when they are present in the index). So, it ends up with having the sample #25000 in both train and test datasets.
Great point. I think it's best to switch to .iloc here