openml-python icon indicating copy to clipboard operation
openml-python copied to clipboard

Issue downloading dataset

Open MichaelMMeskhi opened this issue 4 years ago • 8 comments

Description

When using dataset = openml.datasets.get_dataset(did), a Bad @ATTRIBUTE is thrown.

Steps/Code to Reproduce

import openml
dataset = openml.datasets.get_dataset(did)

Expected Results

No errors thrown.

Actual Results

 File "mfe_3.py", line 33, in <module>
    dataset = openml.datasets.get_dataset(data)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 530, in get_dataset
    description, features, qualities, arff_file
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 1023, in _create_dataset_from_description
    qualities=qualities,
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 183, in __init__
    self.data_pickle_file = self._create_pickle_in_cache(data_file)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 423, in _create_pickle_in_cache
    X, categorical, attribute_names = self._parse_data_from_arff(data_file)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 316, in _parse_data_from_arff
    data = self._get_arff(self.format)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 295, in _get_arff
    return decode_arff(fh)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 288, in decode_arff
    return_type=return_type)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 895, in decode
    raise e
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 892, in decode
    matrix_type=return_type)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 822, in _decode
    attr = self._decode_attribute(row)
  File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 764, in _decode_attribute
    raise BadAttributeType()
arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2

Versions

macOS-10.15.6-x86_64-i386-64bit
Python 3.8.3 (default, Jul  2 2020, 11:26:31) 
[Clang 10.0.0 ]
NumPy 1.18.5
SciPy 1.5.0
Scikit-Learn 0.23.1
OpenML 0.10.2

MichaelMMeskhi avatar Aug 24 '20 20:08 MichaelMMeskhi

Could you please give some details for which dataset(s) this happens?

mfeurer avatar Aug 25 '20 06:08 mfeurer

Could you please give some details for which dataset(s) this happens?

Please see the attached list for did that I found so far.

mfe_medium_errors.txt

MichaelMMeskhi avatar Aug 25 '20 21:08 MichaelMMeskhi

They all seem to be from one source FOREX trading data.

did                                                   41764
name                                  FOREX_gbpusd-day-High
version                                                   1
uploader                                                  1
status                                               active
format                                                 arff
MajorityClassSize                                       937
MaxNominalAttDistinctValues                               2
MinorityClassSize                                       897
NumberOfClasses                                           2
NumberOfFeatures                                         12
NumberOfInstances                                      1834
NumberOfInstancesWithMissingValues                        0
NumberOfMissingValues                                     0
NumberOfNumericFeatures                                  11

MichaelMMeskhi avatar Aug 25 '20 21:08 MichaelMMeskhi

Hey, the issue here is that this data set contains fields of type 'date', which are not supported by the arff parser in python. There's an open PR to support that (https://github.com/renatopp/liac-arff/pull/67), but it's gone stale. We'd be happy if you like to pick that up.

mfeurer avatar Aug 26 '20 06:08 mfeurer

@mfeurer I will look into it and try to see what I can do about it. Thanks for the feedback!

MichaelMMeskhi avatar Aug 31 '20 18:08 MichaelMMeskhi

Hi all, is there any progress on this issue?

joaquinvanschoren avatar Nov 02 '21 10:11 joaquinvanschoren

Yes/No.

Yes: Since 0.12.0 the get_dataset call should no longer raise the error because the data is not actually loaded anymore with that call. This means you get access to the dataset object and metadata.

No: The ARFF parser still does not support the data type in the ARFF file. As soon as you actually try to load the data (e.g. OpenMLDataset.get_data() the same error is thrown.

To me it makes the most sense to wait until the dataset is available in parquet format, since that should hopefully work without issues (and if not, it's worthwhile to improve the parquet support).

PGijsbers avatar Nov 02 '21 10:11 PGijsbers

Hi, I'm a master's student under supervision of @joaquinvanschoren.

As a temporary fix, you could convert the timestamps to a Unix timestamp format and hint to ARFF that it is a numeric type.

I've made some quick adjustments in the decode_arff function that does exactly that, check it out at: https://github.com/chclam/openml-python/commit/136c27940b3cb9974e8272bae67007b4e1be5dc8

chclam avatar Mar 21 '22 14:03 chclam