openml-python
openml-python copied to clipboard
Issue downloading dataset
Description
When using dataset = openml.datasets.get_dataset(did)
, a Bad @ATTRIBUTE
is thrown.
Steps/Code to Reproduce
import openml
dataset = openml.datasets.get_dataset(did)
Expected Results
No errors thrown.
Actual Results
File "mfe_3.py", line 33, in <module>
dataset = openml.datasets.get_dataset(data)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 530, in get_dataset
description, features, qualities, arff_file
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/functions.py", line 1023, in _create_dataset_from_description
qualities=qualities,
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 183, in __init__
self.data_pickle_file = self._create_pickle_in_cache(data_file)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 423, in _create_pickle_in_cache
X, categorical, attribute_names = self._parse_data_from_arff(data_file)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 316, in _parse_data_from_arff
data = self._get_arff(self.format)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 295, in _get_arff
return decode_arff(fh)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/openml/datasets/dataset.py", line 288, in decode_arff
return_type=return_type)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 895, in decode
raise e
File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 892, in decode
matrix_type=return_type)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 822, in _decode
attr = self._decode_attribute(row)
File "/home/mmeskhi/.local/lib/python3.6/site-packages/arff.py", line 764, in _decode_attribute
raise BadAttributeType()
arff.BadAttributeType: Bad @ATTRIBUTE type, at line 2
Versions
macOS-10.15.6-x86_64-i386-64bit
Python 3.8.3 (default, Jul 2 2020, 11:26:31)
[Clang 10.0.0 ]
NumPy 1.18.5
SciPy 1.5.0
Scikit-Learn 0.23.1
OpenML 0.10.2
Could you please give some details for which dataset(s) this happens?
Could you please give some details for which dataset(s) this happens?
Please see the attached list for did
that I found so far.
They all seem to be from one source FOREX
trading data.
did 41764
name FOREX_gbpusd-day-High
version 1
uploader 1
status active
format arff
MajorityClassSize 937
MaxNominalAttDistinctValues 2
MinorityClassSize 897
NumberOfClasses 2
NumberOfFeatures 12
NumberOfInstances 1834
NumberOfInstancesWithMissingValues 0
NumberOfMissingValues 0
NumberOfNumericFeatures 11
Hey, the issue here is that this data set contains fields of type 'date', which are not supported by the arff parser in python. There's an open PR to support that (https://github.com/renatopp/liac-arff/pull/67), but it's gone stale. We'd be happy if you like to pick that up.
@mfeurer I will look into it and try to see what I can do about it. Thanks for the feedback!
Hi all, is there any progress on this issue?
Yes/No.
Yes: Since 0.12.0
the get_dataset
call should no longer raise the error because the data is not actually loaded anymore with that call. This means you get access to the dataset object and metadata.
No: The ARFF parser still does not support the data type in the ARFF file. As soon as you actually try to load the data (e.g. OpenMLDataset.get_data()
the same error is thrown.
To me it makes the most sense to wait until the dataset is available in parquet
format, since that should hopefully work without issues (and if not, it's worthwhile to improve the parquet support).
Hi, I'm a master's student under supervision of @joaquinvanschoren.
As a temporary fix, you could convert the timestamps to a Unix timestamp format and hint to ARFF that it is a numeric type.
I've made some quick adjustments in the decode_arff
function that does exactly that, check it out at: https://github.com/chclam/openml-python/commit/136c27940b3cb9974e8272bae67007b4e1be5dc8