automlbenchmark issue with OpenML tasks openml.org/t/359947 & openml.org/t/360115

issue with OpenML tasks openml.org/t/359947 & openml.org/t/360115

Open sebhrusen opened this issue 3 years ago • 0 comments

Error with TPOT , Autosklearn, RandomForest (and other sklearn-based frameworks):

[ERROR] [amlb.benchmark:01:28:09.198] PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
Traceback (most recent call last):
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 623, in _convert_array_format
    return np.asarray(data, dtype=np.float32)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/pandas/core/generic.py", line 1899, in __array__
    return np.asarray(self._values, dtype=dtype)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '30n20b8'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/seb/repos/ml/automlbenchmark/amlb/benchmark.py", line 512, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/TPOT/__init__.py", line 14, in run
    X_train, X_test = impute_array(dataset.train.X_enc, dataset.test.X_enc)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
    return cache(self, prop_name, prop_fn)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
    value = fn(self)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/data.py", line 146, in X_enc
    return self.data_enc[:, predictors_ind]
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
    return cache(self, prop_name, prop_fn)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
    value = fn(self)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 168, in data_enc
    return self._get_data('array')
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 172, in _get_data
    self.dataset._load_data(fmt)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 129, in _load_data
    train, test = splitter.split()
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 204, in split
    X = self.ds._load_full_data('array')
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 134, in _load_full_data
    X, *_ = self._oml_dataset.get_data(dataset_format=fmt)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 726, in get_data
    data = self._convert_array_format(data, dataset_format, attribute_names)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 626, in _convert_array_format
    "PyOpenML cannot handle string when returning numpy"
openml.exceptions.PyOpenMLError: PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".

Dataset https://www.openml.org/d/41702 doesn't exclude columns as expected:

<!-- desription.xml -->
<oml:ignore_attribute>instance_idrepetitionrunstatus</oml:ignore_attribute>

<!-- features.xml -->
 <oml:feature>
    <oml:index>0</oml:index>
    <oml:name>instance_id</oml:name>
    <oml:data_type>string</oml:data_type>
    <oml:is_target>false</oml:is_target>
    <oml:is_ignore>false</oml:is_ignore>   <!-- true expected -->
    <oml:is_row_identifier>false</oml:is_row_identifier> <!-- true expected -->
    <oml:number_of_missing_values>0</oml:number_of_missing_values>
  </oml:feature>

Dataset https://www.openml.org/d/42759 has many string columns at the end of the 15k features (starts with col index 14740): no exclusions. Looking at data these look like digested/encoded strings. They are actually typed as STRING in the ARFF file, although the description mentions they are categoricals:

% The training set contains 50,000 examples. The first predictive 14,740 variables are numerical and the last 260 predictive variables are categorical. The last target variable is binary (-1,1).

What should we do? Remove the entire dataset (it's one of the few relatively large one: 50k rows x 15k feats)? Exclude the strings? Keep it given that several frameworks are able to handle it? Note that we won't have any reference metric (RF, tunedRF) for this dataset.

Ideally, maybe PyOpenML should automatically ignore the features it can't handle in numpy format and emit a warning.

Jul 28 '21 14:07 sebhrusen

automlbenchmark automlbenchmark copied to clipboard

issue with OpenML tasks openml.org/t/359947 & openml.org/t/360115

automlbenchmark
automlbenchmark copied to clipboard