handson-ml CH02: Use train_test_split instead of StratifiedShuffleSplit

In CH02, the book uses StratifiedShuffleSplit to split data according to income category, maybe it's more user-friendly to use train_test_split code from master:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

updated version:

strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, random_state=42, stratify=housing["income_cat"])

Thanks for the awesome book!

Mar 22 '19 14:03 qinhanmin2014

Hi @qinhanmin2014 , Great suggestion, thanks a lot! Apparently, this parameter was added in Scikit-Learn 0.18, it wasn't there when I started writing the book, I wasn't aware of it. As soon as I have a minute, I'll update the book and the notebooks! Thanks again

Mar 23 '19 07:03 ageron

Strongly agree with this. When I read the book, the StratifiedShuffledSplit code makes me confused. In my understanding, the stratified sampling already did by generate the income_cat, not sure about whether anything further did by StratifiedShuffledSplit. After use train_test_split instead, I think it's clear. Thanks!

Mar 24 '19 04:03 wangyoucao577

Hello, I was following the code mentioned in the book but it isn't working and gives the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-9730d791daa8> in <module>
      2 
      3 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
----> 4 for train_index, test_index in split.split(housing, housing["income_cat"]):
      5     strat_train_set = housing.loc[train_index]
      6     strat_test_set = housing.loc[test_index]

~\Anaconda3_2\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
   1771         to an integer.
   1772         """
-> 1773         y = check_array(y, ensure_2d=False, dtype=None)
   1774         return super(StratifiedShuffleSplit, self).split(X, y, groups)
   1775 

~\Anaconda3_2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    478     # DataFrame), and store them. If not, store None.
    479     dtypes_orig = None
--> 480     if hasattr(array, "dtypes") and len(array.dtypes):
    481         dtypes_orig = np.array(array.dtypes)
    482 

TypeError: object of type 'CategoricalDtype' has no len()

I also tried the code mentioned here(without the stratified shuffle split) but it gave the same error. Please help me rectify this. Thank you so much!

Mar 28 '19 17:03 raool8

@raool8 This is a bug in scikit-learn 0.20.0 and 0.20.1 (resolved in https://github.com/scikit-learn/scikit-learn/pull/12706), please update your scikit-learn.

Mar 29 '19 01:03 qinhanmin2014

thanks a lot

Apr 12 '20 15:04 mr-arka-mardin