openml-python
openml-python copied to clipboard
Don't download (large) datasets by default
Description
In datasets.get_dataset(data_id) the default is currently to always download the dataset: https://openml.github.io/openml-python/master/generated/openml.datasets.get_dataset.html#openml.datasets.get_dataset
This is problematic for large datasets - it takes a long time and may cause out-of-memory errors. Sometimes we need to look at the full meta-data (of many datasets) without downloading the data. We can do that now with the option download_data=False, but it feels like this should be the default. Some users may also be unaware of this option or the fact that get_dataset will actually download the data and consume resources.
A simple solution would be to make download_data=False the default.
Steps/Code to Reproduce
import openml
openml.datasets.get_dataset(41081)
Expected Results
The dataset metadata within seconds
Actual Results
A long time waiting until the dataset has downloaded and parsed.
Versions
macOS-10.16-x86_64-i386-64bit Python 3.8.5 (default, Sep 4 2020, 02:22:02) [Clang 10.0.0 ] NumPy 1.19.5 SciPy 1.5.2 Scikit-Learn 0.23.2 OpenML 0.11.1dev
I'd also prefer this. I'd go as far as that I'd prefer lazy loading for all data that requires disk or network operations.
We set not downloading to be the default from 0.15.0 onwards in PR #1260.