automlbenchmark
automlbenchmark copied to clipboard
Add support for sparse data
Currently sparse dataset are automatically converted into dense data, generating extremely large datasets that can lead to OOM. OpenML provide some datasets in sparse ARFF format: see for exampel https://www.openml.org/t/317613
The benchmark app needs to be able to load those sparse data and pass them to frameworks without converting them to dense data, then it is left to frameworks responsibility to handle those sparse data. If they don't we can provide a utility function to convert them into densa data, knowing that it may lead to OOM in some situations.
- For frameworks loading ARFF files directly, the app still needs to write the corresponding sparse ARFF files for each fold.
- For frameworks only handling numpy arrays or pandas dataframes, we need to expose those data as a sparse matrix / pandas dataframe.
We may want to implement https://github.com/openml/automlbenchmark/issues/116 first and then use pandas sparse dataframes.
We need to verify sparse data handling now that https://github.com/openml/automlbenchmark/pull/293 is merged.