handson-ml icon indicating copy to clipboard operation
handson-ml copied to clipboard

CH02: Use train_test_split instead of StratifiedShuffleSplit

Open qinhanmin2014 opened this issue 6 years ago • 5 comments

In CH02, the book uses StratifiedShuffleSplit to split data according to income category, maybe it's more user-friendly to use train_test_split code from master:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

updated version:

strat_train_set, strat_test_set = train_test_split(
    housing, test_size=0.2, random_state=42, stratify=housing["income_cat"])

Thanks for the awesome book!

qinhanmin2014 avatar Mar 22 '19 14:03 qinhanmin2014

Hi @qinhanmin2014 , Great suggestion, thanks a lot! Apparently, this parameter was added in Scikit-Learn 0.18, it wasn't there when I started writing the book, I wasn't aware of it. As soon as I have a minute, I'll update the book and the notebooks! Thanks again

ageron avatar Mar 23 '19 07:03 ageron

Strongly agree with this. When I read the book, the StratifiedShuffledSplit code makes me confused. In my understanding, the stratified sampling already did by generate the income_cat, not sure about whether anything further did by StratifiedShuffledSplit. After use train_test_split instead, I think it's clear. Thanks!

wangyoucao577 avatar Mar 24 '19 04:03 wangyoucao577

Hello, I was following the code mentioned in the book but it isn't working and gives the following error:

TypeError                                 Traceback (most recent call last)
<ipython-input-24-9730d791daa8> in <module>
      2 
      3 split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
----> 4 for train_index, test_index in split.split(housing, housing["income_cat"]):
      5     strat_train_set = housing.loc[train_index]
      6     strat_test_set = housing.loc[test_index]

~\Anaconda3_2\lib\site-packages\sklearn\model_selection\_split.py in split(self, X, y, groups)
   1771         to an integer.
   1772         """
-> 1773         y = check_array(y, ensure_2d=False, dtype=None)
   1774         return super(StratifiedShuffleSplit, self).split(X, y, groups)
   1775 

~\Anaconda3_2\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    478     # DataFrame), and store them. If not, store None.
    479     dtypes_orig = None
--> 480     if hasattr(array, "dtypes") and len(array.dtypes):
    481         dtypes_orig = np.array(array.dtypes)
    482 

TypeError: object of type 'CategoricalDtype' has no len()

I also tried the code mentioned here(without the stratified shuffle split) but it gave the same error. Please help me rectify this. Thank you so much!

raool8 avatar Mar 28 '19 17:03 raool8

@raool8 This is a bug in scikit-learn 0.20.0 and 0.20.1 (resolved in https://github.com/scikit-learn/scikit-learn/pull/12706), please update your scikit-learn.

qinhanmin2014 avatar Mar 29 '19 01:03 qinhanmin2014

thanks a lot

mr-arka-mardin avatar Apr 12 '20 15:04 mr-arka-mardin