automlbenchmark
automlbenchmark copied to clipboard
issue with OpenML tasks openml.org/t/359947 & openml.org/t/360115
Error with TPOT , Autosklearn, RandomForest (and other sklearn-based frameworks):
[ERROR] [amlb.benchmark:01:28:09.198] PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
Traceback (most recent call last):
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 623, in _convert_array_format
return np.asarray(data, dtype=np.float32)
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/pandas/core/generic.py", line 1899, in __array__
return np.asarray(self._values, dtype=dtype)
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '30n20b8'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/seb/repos/ml/automlbenchmark/amlb/benchmark.py", line 512, in run
meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
File "/Users/seb/repos/ml/automlbenchmark/frameworks/TPOT/__init__.py", line 14, in run
X_train, X_test = impute_array(dataset.train.X_enc, dataset.test.X_enc)
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
return cache(self, prop_name, prop_fn)
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
value = fn(self)
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
return fn(*args, **kwargs)
File "/Users/seb/repos/ml/automlbenchmark/amlb/data.py", line 146, in X_enc
return self.data_enc[:, predictors_ind]
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
return cache(self, prop_name, prop_fn)
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
value = fn(self)
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
return fn(*args, **kwargs)
File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 168, in data_enc
return self._get_data('array')
File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 172, in _get_data
self.dataset._load_data(fmt)
File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 129, in _load_data
train, test = splitter.split()
File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
return fn(*args, **kwargs)
File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 204, in split
X = self.ds._load_full_data('array')
File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 134, in _load_full_data
X, *_ = self._oml_dataset.get_data(dataset_format=fmt)
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 726, in get_data
data = self._convert_array_format(data, dataset_format, attribute_names)
File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 626, in _convert_array_format
"PyOpenML cannot handle string when returning numpy"
openml.exceptions.PyOpenMLError: PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
- Dataset https://www.openml.org/d/41702 doesn't exclude columns as expected:
<!-- desription.xml -->
<oml:ignore_attribute>instance_idrepetitionrunstatus</oml:ignore_attribute>
<!-- features.xml -->
<oml:feature>
<oml:index>0</oml:index>
<oml:name>instance_id</oml:name>
<oml:data_type>string</oml:data_type>
<oml:is_target>false</oml:is_target>
<oml:is_ignore>false</oml:is_ignore> <!-- true expected -->
<oml:is_row_identifier>false</oml:is_row_identifier> <!-- true expected -->
<oml:number_of_missing_values>0</oml:number_of_missing_values>
</oml:feature>
- Dataset https://www.openml.org/d/42759 has many string columns at the end of the 15k features (starts with col index 14740): no exclusions. Looking at data these look like digested/encoded strings. They are actually typed as
STRING
in the ARFF file, although the description mentions they are categoricals:
% The training set contains 50,000 examples. The first predictive 14,740 variables are numerical and the last 260 predictive variables are categorical. The last target variable is binary (-1,1).
What should we do? Remove the entire dataset (it's one of the few relatively large one: 50k rows x 15k feats)? Exclude the strings? Keep it given that several frameworks are able to handle it? Note that we won't have any reference metric (RF, tunedRF) for this dataset.
- Ideally, maybe PyOpenML should automatically ignore the features it can't handle in
numpy
format and emit a warning.