sparse icon indicating copy to clipboard operation
sparse copied to clipboard

A dense axis

Open zerothi opened this issue 5 years ago • 11 comments

Would it make sense to allow certain axes as dense parts in the otherwise sparse array?

I.e. this could/would allow faster algorithms along the given axes.

In particular when dealing with sparse data with many axes but a few recurring dense axis, such abstraction would be really nice.

Would this be an option or is it too intrusive?

zerothi avatar Jan 10 '19 22:01 zerothi

Hi! This is roughly the purpose of the BSR format in SciPy.

This is definitely in scope, although I don’t personally plan to do it. You’re free to give it a crack if you want, I’ll definitely help review.

hameerabbasi avatar Jan 10 '19 23:01 hameerabbasi

I suppose you could make a format similar to CSR called CDR (compressed-dense-row) where you'd only have indptr and data arrays, and each nonzero row would be fully dense.

I'm not aware of any usage of this scheme in the wild, and it would really only be efficient for cases where the sparsity pattern is almost always on a per-row basis.

perimosocordiae avatar Jan 11 '19 14:01 perimosocordiae

Ok, thanks for your insights.

zerothi avatar Jan 11 '19 14:01 zerothi

The most generic way to do this, in my view, would be to have an arbitrary partition of "dense" and "sparse" axes. So the sparse array would effectively be a transpose and reshaping of a CSC sparse matrix.

mcg1969 avatar Mar 08 '19 15:03 mcg1969

Yes, one could definitely store the "inner dense dimensions" as an actual numpy array instead. And always reshuffle things appropriately when indexing or performing any other operations.

hameerabbasi avatar Mar 08 '19 15:03 hameerabbasi

I thought about this some more and it seems the most generic thing to do is to have blocks rather than dense dimensions.

hameerabbasi avatar Mar 10 '19 11:03 hameerabbasi

It took me a bit to figure out that "blocks are more generic than dense dimensions" meant that one could achieve the effect of dense dimensions by making the block size in that dimension equal to the array's size in that dimension.

There was a concern in another thread recently about the availability of algorithms for fancy new formats. Similar schemes have been around for several decades, but I don't know that anyone's been working on block COO matrices rather than block CSR matrices.

Out of curiosity, is enough of sparse independent of scipy that a sparse.COO subclass with a properly-dimensioned data array would require only shape-checking modifications to .__init__() and a change to treating the individual elements as arrays in .__matmul__() and .transpose(), or would this require more work?

DWesl avatar Mar 22 '19 23:03 DWesl

I could really use such a feature. In case you are interested in a concrete use for this type of arrays, I am working with (time inhomogeneous) Markov Chains. The 2 dimensions needed for the state space should be sparse, but the time dimension is always dense.

luk-f-a avatar Mar 24 '19 14:03 luk-f-a

Out of curiosity, is enough of sparse independent of scipy that a sparse.COO subclass with a properly-dimensioned data array would require only shape-checking modifications to .__init__() and a change to treating the individual elements as arrays in .__matmul__() and .transpose(), or would this require more work?

We only use SciPy to convert to and from SciPy arrays, so it's almost completely independent. However, the issue here is that every algorithm would need to be modified to work with the new format... If you (or anyone else) makes a PR, we can get a "base set" of features working, with the rest to come. :slightly_smiling_face: I'm very likely to jump in and help.

hameerabbasi avatar Mar 25 '19 18:03 hameerabbasi

For my objective of utilizing slicing functionality of pydata/sparse, using structured array as COO.data seems enough for me.

But it looks that we need to implement arithmetic operations for COO with np.recarray data.

import numpy as np                                                              
from sparse import COO                                                          
                                                                                
rank = 3                                                                        
n_points = 100                                                                  
                                                                                
for denserank in [1, 2, 5]:                                                     
  coords = np.random.randint(0, 2, (rank, n_points))  # duplicated coords       
  coords2 = coords + 1                                                          
  data = np.random.randn(n_points, denserank).view('f8, ' * denserank).squeeze()
  try:                                                                          
    spa1 = COO(coords, data, shape=(20, 20, 20),)                               
    spa2 = COO(coords2, data, shape=(20, 20, 20),)                              
  except Exception as e:                                                        
    print("has_duplicate=True failed: denserank={} {}".format(denserank, e))             
    spa1 = COO(coords, data, shape=(20, 20, 20), has_duplicates=False)          
    spa2 = COO(coords2, data, shape=(20, 20, 20), has_duplicates=False)         
                                                                                
  print("spa1[1:]", spa1[1:])                                                   
  try:                                                                          
      spa1 + spa2                                                               
  except Exception as e:                                                        
    print("spa1 + spa2 failed: denserank={} {}".format(denserank, e))                                 

Haetsal-Lee avatar May 26 '20 09:05 Haetsal-Lee

I have a few questions regarding the potential implementation.

Format

@hameerabbasi can you elaborate on

I thought about this some more and it seems the most generic thing to do is to have blocks rather than dense dimensions.

Do you mean that every dimension would have a block size, which would be implemented as a separate array dimension in the data array? That seems more complicated than dense axes because slicing such block-ed sparse axis would potentially break the block structure.

Class structure

What would be the best overall approach to implementing dense dimensions? It seems that dense dimensions/blocks belong to the base class because they would influence the shape calculations, is that correct?

Possible alternatives would be to modify one of the formats to supposed dense axes or implementing a mixin class.

Minimal functionality

What would be the minimal useful functionality set? It seems like per-axis sparse/dense conversion, slicing, tensordot, and elementwise ufunc would all be reasonable to consider.

akhmerov avatar Aug 18 '21 11:08 akhmerov