CH02: Use train_test_split instead of StratifiedShuffleSplit
In CH02, the book uses StratifiedShuffleSplit to split data according to income category, maybe it's more user-friendly to use train_test_split code from master:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
updated version:
strat_train_set, strat_test_set = train_test_split(
housing, test_size=0.2, random_state=42, stratify=housing["income_cat"])
Thanks for the awesome book!
Hi @qinhanmin2014 , Great suggestion, thanks a lot! Apparently, this parameter was added in Scikit-Learn 0.18, it wasn't there when I started writing the book, I wasn't aware of it. As soon as I have a minute, I'll update the book and the notebooks! Thanks again
Strongly agree with this. When I read the book, the StratifiedShuffledSplit code makes me confused. In my understanding, the stratified sampling already did by generate the income_cat, not sure about whether anything further did by StratifiedShuffledSplit. After use train_test_split instead, I think it's clear. Thanks!
Hello, I was following the code mentioned in the book but it isn't working and gives the following error:
TypeError Traceback (most recent call last)
<ipython-input-24-9730d791daa8> in <module>
2
3 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
----> 4 for train_index, test_index in split.split(housing, housing["income_cat"]):
5 strat_train_set = housing.loc[train_index]
6 strat_test_set = housing.loc[test_index]
~\Anaconda3_2\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
1771 to an integer.
1772 """
-> 1773 y = check_array(y, ensure_2d=False, dtype=None)
1774 return super(StratifiedShuffleSplit, self).split(X, y, groups)
1775
~\Anaconda3_2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
478 # DataFrame), and store them. If not, store None.
479 dtypes_orig = None
--> 480 if hasattr(array, "dtypes") and len(array.dtypes):
481 dtypes_orig = np.array(array.dtypes)
482
TypeError: object of type 'CategoricalDtype' has no len()
I also tried the code mentioned here(without the stratified shuffle split) but it gave the same error. Please help me rectify this. Thank you so much!
@raool8 This is a bug in scikit-learn 0.20.0 and 0.20.1 (resolved in https://github.com/scikit-learn/scikit-learn/pull/12706), please update your scikit-learn.
thanks a lot