dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Sparse columns

Open ogrisel opened this issue 3 years ago • 8 comments

Should a dedicated API/column metadata to efficiently support sparse columns be part of the spec?

Context

It can be the case than a given column has more more than 99% of its values that are null or missing (or other repeated constant value) and therefore we would waste both memory and computation by using a dedicated memory representation that does not materialize explicitly these repeated values.

Use cases

  • efficient computation: e.g. computing the mean and standard deviation of a sparse column with more then 99% of zeros
  • efficient computation: e.g. computing the nanmean and nanstd of a sparse column with more then 99% are missing
  • some machine learning estimators have special treatments of sparse columns (e.g. for memory efficient representation of one-hot encoded categorical data), but often they could (in theory) be changed to handle categorical variables using a different representation if explicitly tagged as such.

Limitations

  • treating sparsity at the single column levels can be limiting. some machine learning algorithms that leverage sparsity can only do so when considering many sparse columns together as a sparse matrix using a Compressed-Sparse-Rows (CSR) representation (e.g. logistic regression with non-coordinate-based gradient-based solvers (SGD, L-BFGS...) and kernel machines (support vector machines, Gaussian processes, kernel approximation methods...)
  • other can leverage sparsity in a column-wise manner, typically by accepting Compressed Sparse Columns (CSC) data (e.g. coordinate descent solvers for the Lasso, random forests, gradient boosting trees...)

Survey of existing support

(incomplete, feel free to edit or comment)

Questions:

  • Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)
  • Should this be some kind of optional module / extension of the main dataframe API spec?

ogrisel avatar Aug 25 '21 08:08 ogrisel

Note: there is a dedicated discussion for single-column categorical data representation in #41.

ogrisel avatar Aug 25 '21 09:08 ogrisel

Should sparse datastructures be allowed to represent both missingness and nullness or only one of those? (I assume both would be useful as pandas does with the fill_value param)

That's a really subtle question, which isn't even worked out in array/tensor libraries that provide sparse data structures. My first impression was leaving it undefined, because interpretation does not necessarily depend on memory layout. However there is an interaction with the missing data support already, so that may not be feasible.

fill_value was something that was looked at quite a bit for PyTorch, but it seems like there's additional complexity and very limited use cases for non-zero fill values.

rgommers avatar Aug 26 '21 14:08 rgommers

Should this be some kind of optional module / extension of the main dataframe API spec?

It seems like there's only a few libraries that support sparse columns. Perhaps a first step would be to use the metadata attribute to store a sparse column and see if two of those libraries can be made to work together. A concrete use case would help a lot.

Memory layout wise sparse is a bit of a problem. Pandas seems to use COO; scipy.sparse has many formats however CSR/CSC are the most performant ones. It'd be nontrivial to have a clear memory layout description here that isn't overly complex.

rgommers avatar Aug 26 '21 14:08 rgommers

A concrete use case would help a lot.

A concrete use case would be to do lossless rountrip conversions of very sparse data between libraries that implement sparse columns either for zeroness or missingness (or both ideally) without triggering a unexpectedly large memory allocation or a MemoryError or trigger the OOM killer.

For instance we could have a dataframe storing the one-hot encoded representation of 6M Wikipedia abstracts with 100000 columns for the 100000 most frequent words in Wikipedia. Assuming Wikipedia abstracts have much less than 1000 words on average, this should easily fit in memory using a sparse representation but this would probably break (or be very inefficient) if the conversion is trying to materialize the zeros silently.

ogrisel avatar Sep 01 '21 14:09 ogrisel

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation. Furthermore text processing with one-hot encoding is less and less popular now that most interesting NLP tasks are done using lower dimensional dense embeddings from pre-trained neural networks.

ogrisel avatar Sep 01 '21 14:09 ogrisel

Thanks @ogrisel, the application makes a lot of sense.

That being said, I am not sure that dataframe libraries are used often for this kind of sparse data manipulation.

Indeed, with use case I also meant: can this actually be done today with two dataframe libraries? If there's no two libraries with support for the same format of sparse data, then adding the capability to the protocol may be a bit premature.

rgommers avatar Sep 02 '21 08:09 rgommers

pandas and vaex both support sparse data (for zeroness) without materialization although with different memory layouts. vaex uses a scipy.sparse CSR matrix while pandas have individual sparse columns.

arrow has null chunks that do not store any values if a full chunk is null.

ogrisel avatar Sep 03 '21 09:09 ogrisel

So we probably should have a prototype that goes from one of pandas/Vaex/Arrow to another one of those libraries without a densification step in between. That may result in something that can be generalized. Given that scipy.sparse should be able to convert between CSR and COO efficiently and pandas is based on COO (with a df.sparse.to_coo() to export to scipy.sparse format), that should be doable.

rgommers avatar Sep 06 '21 18:09 rgommers