openml-python Don't download (large) datasets by default

Don't download (large) datasets by default

Open joaquinvanschoren opened this issue 3 years ago • 2 comments

Description

In datasets.get_dataset(data_id) the default is currently to always download the dataset: https://openml.github.io/openml-python/master/generated/openml.datasets.get_dataset.html#openml.datasets.get_dataset

This is problematic for large datasets - it takes a long time and may cause out-of-memory errors. Sometimes we need to look at the full meta-data (of many datasets) without downloading the data. We can do that now with the option download_data=False, but it feels like this should be the default. Some users may also be unaware of this option or the fact that get_dataset will actually download the data and consume resources.

A simple solution would be to make download_data=False the default.

Steps/Code to Reproduce

import openml
openml.datasets.get_dataset(41081)

Expected Results

The dataset metadata within seconds

Actual Results

A long time waiting until the dataset has downloaded and parsed.

Versions

macOS-10.16-x86_64-i386-64bit Python 3.8.5 (default, Sep 4 2020, 02:22:02) [Clang 10.0.0 ] NumPy 1.19.5 SciPy 1.5.2 Scikit-Learn 0.23.2 OpenML 0.11.1dev

Mar 09 '21 10:03 joaquinvanschoren

I'd also prefer this. I'd go as far as that I'd prefer lazy loading for all data that requires disk or network operations.

Mar 10 '21 14:03 PGijsbers

We set not downloading to be the default from 0.15.0 onwards in PR #1260.

Jun 16 '23 08:06 LennartPurucker

openml-python openml-python copied to clipboard

Don't download (large) datasets by default

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

openml-python
openml-python copied to clipboard