mljar-supervised
mljar-supervised copied to clipboard
Splitting train/test has off-by-one error
I got this error:
The sum of train_size and test_size = 55757, should be smaller than the number of samples 55756. Reduce test_size and/or train_size.
Traceback (most recent call last):
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 1084, in _fit
trained = self.train_model(params)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 371, in train_model
mf.train(results_path, model_subpath)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\model_framework.py", line 165, in train
train_data, validation_data = self.validation.get_split(k_fold, repeat)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validation_step.py", line 30, in get_split
return self.validator.get_split(k, repeat)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validator_split.py", line 76, in get_split
X_train, X_validation, y_train, y_validation = train_test_split(
File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2175, in train_test_split
n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1849, in _validate_shuffle_split
raise ValueError('The sum of train_size and test_size = %d, '
ValueError: The sum of train_size and test_size = 55757, should be smaller than the number of samples 55756. Reduce test_size and/or train_size.
Here's the code that splits the dataset
This is the code that is used for training:
from supervised.automl import AutoML
automl = AutoML(
total_time_limit=3600,
mode='Perform',
ml_task='binary_classification',
eval_metric='auc',
max_single_prediction_time=None,
golden_features=False,
kmeans_features=False,
train_ensemble=True,
algorithms=[
# 'Baseline',
# 'Linear',
# 'Decision Tree',
'Random Forest',
'Extra Trees',
'LightGBM',
'Xgboost',
'CatBoost',
'Neural Network'
],
validation_strategy={
"validation_type": "split",
"train_ratio": train_ratio,
"shuffle": False,
"stratify": False
},
)
automl.fit(X, y)
What is the cause of this off-by-one error? How do I fix it? It seems mljar probably interpreted the ratio wrongly or round the number wrongly somehow. I just wanted to feed my own validation set to the training process (no CV, just simple cut-in-the-middle split)
@off99555 maybe try to set it to the exact samples number, like: train_ratio=37171
- I think it should work and use first 37171 samples from X
. Please let me know if it works for you.
I set train_ratio=len(X_train)
and here's the new error:
test_size=-37170.0 should be either positive and smaller than the number of samples 55756 or a float in the (0, 1) range
Traceback (most recent call last):
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 1084, in _fit
trained = self.train_model(params)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\base_automl.py", line 371, in train_model
mf.train(results_path, model_subpath)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\model_framework.py", line 165, in train
train_data, validation_data = self.validation.get_split(k_fold, repeat)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validation_step.py", line 30, in get_split
return self.validator.get_split(k, repeat)
File "C:\Users\off99\anaconda3\lib\site-packages\supervised\validation\validator_split.py", line 76, in get_split
X_train, X_validation, y_train, y_validation = train_test_split(
File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 2175, in train_test_split
n_train, n_test = _validate_shuffle_split(n_samples, test_size, train_size,
File "C:\Users\off99\anaconda3\lib\site-packages\sklearn\model_selection\_split.py", line 1811, in _validate_shuffle_split
raise ValueError('test_size={0} should be either positive and smaller'
ValueError: test_size=-37170.0 should be either positive and smaller than the number of samples 55756 or a float in the (0, 1) range
Why is it saying test size is -37170 though? It should be 18585.
Please try to set train_ratio=0.665
- manually set lower value.
It looks that there is a bug in the code that is setting train_size
and test_size
https://github.com/mljar/mljar-supervised/blob/f695fe5cad7fd075c6d7e2a72e9b8f8f18ddb1f2/supervised/validation/validator_split.py#L79-L80
Thanks. I will use that work around for now!