automlbenchmark icon indicating copy to clipboard operation
automlbenchmark copied to clipboard

Add support for sparse data

Open sebhrusen opened this issue 4 years ago • 1 comments

Currently sparse dataset are automatically converted into dense data, generating extremely large datasets that can lead to OOM. OpenML provide some datasets in sparse ARFF format: see for exampel https://www.openml.org/t/317613

The benchmark app needs to be able to load those sparse data and pass them to frameworks without converting them to dense data, then it is left to frameworks responsibility to handle those sparse data. If they don't we can provide a utility function to convert them into densa data, knowing that it may lead to OOM in some situations.

  • For frameworks loading ARFF files directly, the app still needs to write the corresponding sparse ARFF files for each fold.
  • For frameworks only handling numpy arrays or pandas dataframes, we need to expose those data as a sparse matrix / pandas dataframe.

We may want to implement https://github.com/openml/automlbenchmark/issues/116 first and then use pandas sparse dataframes.

sebhrusen avatar Nov 19 '20 14:11 sebhrusen

We need to verify sparse data handling now that https://github.com/openml/automlbenchmark/pull/293 is merged.

sebhrusen avatar May 20 '21 18:05 sebhrusen