automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

issue with OpenML tasks openml.org/t/359947 & openml.org/t/360115

Open sebhrusen opened this issue 3 years ago • 0 comments

Error with TPOT , Autosklearn, RandomForest (and other sklearn-based frameworks):

[ERROR] [amlb.benchmark:01:28:09.198] PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
Traceback (most recent call last):
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 623, in _convert_array_format
    return np.asarray(data, dtype=np.float32)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/pandas/core/generic.py", line 1899, in __array__
    return np.asarray(self._values, dtype=dtype)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/numpy/core/_asarray.py", line 102, in asarray
    return array(a, dtype, copy=False, order=order)
ValueError: could not convert string to float: '30n20b8'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/seb/repos/ml/automlbenchmark/amlb/benchmark.py", line 512, in run
    meta_result = self.benchmark.framework_module.run(self._dataset, task_config)
  File "/Users/seb/repos/ml/automlbenchmark/frameworks/TPOT/__init__.py", line 14, in run
    X_train, X_test = impute_array(dataset.train.X_enc, dataset.test.X_enc)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
    return cache(self, prop_name, prop_fn)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
    value = fn(self)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/data.py", line 146, in X_enc
    return self.data_enc[:, predictors_ind]
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 75, in decorator
    return cache(self, prop_name, prop_fn)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/cache.py", line 33, in cache
    value = fn(self)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 168, in data_enc
    return self._get_data('array')
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 172, in _get_data
    self.dataset._load_data(fmt)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 129, in _load_data
    train, test = splitter.split()
  File "/Users/seb/repos/ml/automlbenchmark/amlb/utils/process.py", line 702, in profiler
    return fn(*args, **kwargs)
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 204, in split
    X = self.ds._load_full_data('array')
  File "/Users/seb/repos/ml/automlbenchmark/amlb/datasets/openml.py", line 134, in _load_full_data
    X, *_ = self._oml_dataset.get_data(dataset_format=fmt)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 726, in get_data
    data = self._convert_array_format(data, dataset_format, attribute_names)
  File "/Users/seb/.pyenv/versions/amlb/lib/python3.7/site-packages/openml/datasets/dataset.py", line 626, in _convert_array_format
    "PyOpenML cannot handle string when returning numpy"
openml.exceptions.PyOpenMLError: PyOpenML cannot handle string when returning numpy arrays. Use dataset_format="dataframe".
  • Dataset https://www.openml.org/d/41702 doesn't exclude columns as expected:
<!-- desription.xml -->
<oml:ignore_attribute>instance_idrepetitionrunstatus</oml:ignore_attribute> 
<!-- features.xml -->
 <oml:feature>
    <oml:index>0</oml:index>
    <oml:name>instance_id</oml:name>
    <oml:data_type>string</oml:data_type>
    <oml:is_target>false</oml:is_target>
    <oml:is_ignore>false</oml:is_ignore>   <!-- true expected -->
    <oml:is_row_identifier>false</oml:is_row_identifier> <!-- true expected -->
    <oml:number_of_missing_values>0</oml:number_of_missing_values>
  </oml:feature>
  • Dataset https://www.openml.org/d/42759 has many string columns at the end of the 15k features (starts with col index 14740): no exclusions. Looking at data these look like digested/encoded strings. They are actually typed as STRING in the ARFF file, although the description mentions they are categoricals:
% The training set contains 50,000 examples. The first predictive 14,740 variables are numerical and the last 260 predictive variables are categorical. The last target variable is binary (-1,1).

What should we do? Remove the entire dataset (it's one of the few relatively large one: 50k rows x 15k feats)? Exclude the strings? Keep it given that several frameworks are able to handle it? Note that we won't have any reference metric (RF, tunedRF) for this dataset.

  • Ideally, maybe PyOpenML should automatically ignore the features it can't handle in numpy format and emit a warning.

sebhrusen avatar Jul 28 '21 14:07 sebhrusen