mljar-supervised icon indicating copy to clipboard operation
mljar-supervised copied to clipboard

Splitting train/test has off-by-one error

Open offchan42 opened this issue 2 years ago • 4 comments

I got this error:

The sum of train_size and test_size = 55757, should be smaller than the number of samples 55756. Reduce test_size and/or train_size.
Traceback (most recent call last):
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 1084, in _fit
    trained = self.train_model(params)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 371, in train_model
    mf.train(results_path, model_subpath)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\model_framework.py", line 165, in train
    train_data, validation_data = self.validation.get_split(k_fold, repeat)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validation_step.py", line 30, in get_split
    return self.validator.get_split(k, repeat)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validator_split.py", line 76, in get_split
    X_train, X_validation, y_train, y_validation = train_test_split(
  File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2175, in train_test_split
    n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
  File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1849, in _validate_shuffle_split
    raise ValueError('The sum of train_size and test_size = %d, '
ValueError: The sum of train_size and test_size = 55757, should be smaller than the number of samples 55756. Reduce test_size and/or train_size.

Here's the code that splits the dataset image

This is the code that is used for training:

from supervised.automl import AutoML
automl = AutoML(
    total_time_limit=3600,
    mode='Perform',
    ml_task='binary_classification',
    eval_metric='auc',
    max_single_prediction_time=None,
    golden_features=False,
    kmeans_features=False,
    train_ensemble=True,
    algorithms=[
        # 'Baseline',
        # 'Linear',
        # 'Decision Tree',
        'Random Forest',
        'Extra Trees',
        'LightGBM',
        'Xgboost',
        'CatBoost',
        'Neural Network'
    ],
    validation_strategy={
        "validation_type": "split",
        "train_ratio": train_ratio,
        "shuffle": False,
        "stratify": False
    },
)
automl.fit(X, y)

What is the cause of this off-by-one error? How do I fix it? It seems mljar probably interpreted the ratio wrongly or round the number wrongly somehow. I just wanted to feed my own validation set to the training process (no CV, just simple cut-in-the-middle split)

offchan42 avatar Oct 27 '21 12:10 offchan42

@off99555 maybe try to set it to the exact samples number, like: train_ratio=37171 - I think it should work and use first 37171 samples from X. Please let me know if it works for you.

pplonski avatar Oct 27 '21 12:10 pplonski

I set train_ratio=len(X_train) and here's the new error:

test_size=-37170.0 should be either positive and smaller than the number of samples 55756 or a float in the (0, 1) range
Traceback (most recent call last):
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 1084, in _fit
    trained = self.train_model(params)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 371, in train_model
    mf.train(results_path, model_subpath)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\model_framework.py", line 165, in train
    train_data, validation_data = self.validation.get_split(k_fold, repeat)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validation_step.py", line 30, in get_split
    return self.validator.get_split(k, repeat)
  File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validator_split.py", line 76, in get_split
    X_train, X_validation, y_train, y_validation = train_test_split(
  File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2175, in train_test_split
    n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
  File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1811, in _validate_shuffle_split
    raise ValueError('test_size={0} should be either positive and smaller'
ValueError: test_size=-37170.0 should be either positive and smaller than the number of samples 55756 or a float in the (0, 1) range

Why is it saying test size is -37170 though? It should be 18585.

offchan42 avatar Oct 27 '21 13:10 offchan42

Please try to set train_ratio=0.665 - manually set lower value.

It looks that there is a bug in the code that is setting train_size and test_size https://github.com/mljar/mljar-supervised/blob/f695fe5cad7fd075c6d7e2a72e9b8f8f18ddb1f2/supervised/validation/validator_split.py#L79-L80

pplonski avatar Oct 27 '21 13:10 pplonski

Thanks. I will use that work around for now!

offchan42 avatar Oct 27 '21 14:10 offchan42