java icon indicating copy to clipboard operation
java copied to clipboard

SparseNDArray and SparseTensor.

Open JimClarke5 opened this issue 4 years ago • 2 comments
trafficstars

This may be a more appropriate issue for java-ndarray, but I thought I would kick off discussion here as it relates to SparseTensor.

@karl has asked that I explore setting up a SparseNDArray within java-ndarray, and I have done that partially, but I thought I would highlight some issues I am running up against.

I started writing a SparseNDArray under a new package (sparse) in java-ndarray. A Sparse Array contains a dense shape, indices of coordinates as a LongNDArray, and a set of values as U extends NdArray<T> array (e.g. FloatSparseNdArray would have a FloatNDArray for the values). The indices would contain a set of coordinates in the dense shape space that would match non-zero values in the hypothetical dense array that the sparse array represents. The index of the matched coordinates would also be the index into the Values array for the corresponding value entry.

The indices shape is [N, ndims], where N is the number non-zero values, and ndims is the number of dimensions in the dense shape. The values shape is [N], or the number of non-zero values.

It is straight forward to create an AbstractSparseNdArray class and a subclass, e.g FloatSparseNdArray. It is also straight forward to get a value based on getObject using full coordinates. First do a binary search of the sorted indices and if the there is a match, return the corresponding value from the values ndarray, else return zero.

When doing the setObject commands it becomes a little more complicated. If it is changing one non-zero value to another non-zero value, then just change the corresponding entry in the values array. However, if the value changes from zero to non-zero, one has to add to the indices and values arrays. Likewise, if a value changes from non-zero to zero, it entails removing the corresponding entries in the indices and values. This, on the surface, creates a lot of objects, and we have already mentioned object creation as a problem in java-ndarray. (A work around for the non-zero to zero problem would be to just set the value in the values array to zero.)

Some other patterns in java-ndarray are very problematic.

For example,

  1. What does parse.elements(dimensionIdx) mean? Is dimensionIdx equivalent to a row in the dense matrix? How do you iterate the elements?
  2. How do you slice the sparse array? Presumably, there would have to be a window into the sparse array. In the existing NDArray classes this is done via a DataBuffer and DataBufferWindow which are dense and do not apply to sparse.
  3. What does NdArray<T> get(long... coordinates) mean? How do you get an NDArray if you only provide partial coordinates? For partial coordinates, the concept of a higher dimension NDArray does not make sense unless it is converted to dense.
  4. Same issue with NdArray<T> set(NdArray<T> src, long... coordinates) ( but one could walk the src array and pull out non-zero entires and make the internal modifications).

One option to deal with these issues is to make all or part of the sparse array dense when required, but doesn't that defeat the purpose of a sparse array?

In looking for how other packages deal with sparse matrices, an analogous class in SciPy, scipy.sparse.coo_matrix states: scipy.sparse.coo_matrix

Disadvantages of the COO format
   does not directly support:
       arithmetic operations
       slicing

It is also noteworthy, the the coo_matrix class is in the SciPy package and not in NumPy, so it is not constrained by NumPy.array semantics.

My first question is do we need all this complexity to represent a SparseTensor, or is there a simpler way to create it without dealing with all the ambiguities between Sparse and Dense in the java-ndarray package? This is what Python TF did with the SparseTensor class. In Python TF SparseTensor, the dense_shape, indices and values are Tensors that are passed to the tf.sparse low level APIs. The SparseTensor is a Tensor that is passed in higher level operations, but it is a first class Tensor.

I would appreciate any thoughts on this.

JimClarke5 avatar Jul 13 '21 20:07 JimClarke5

Thanks @JimClarke5 for looking at this,

When doing the setObject commands it becomes a little more complicated

I'm wondering if our sparse ND arrays should not be read-only. DataBuffers used to instantiate an NDArray can be marked as read-only and will trigger an exception in any attempt to write to it (...and I just noticed that this exception is not documented in the NdArray javadoc, just the DataBuffer). But (and maybe just temporarily), sparse nd-arrays not backed by TensorFlow could be read-only and their tensor implementation (backed by TF) could support write operation calling TF sparse ops?

What does parse.elements(dimensionIdx) mean? Is dimensionIdx equivalent to a row in the dense matrix? How do you iterate the elements?

It will iterate on all elements on the nth dimension, where n = dimensionIdx. For a rank-3 array, m.elements(0) will iterate through all matrices, m.elements(1) will iterate through all vectors and m.elements(2) will iterate through all scalars.

The iteration is simply done by incrementing the coordinates of the iterator on that dimension to access the next element.

For example, on m.elements(1) in the previous example, you'll will visit the vectors of the second dimension in that order: [0, 0, *], [0, 1, *], [1, 0, *], [1, 1, *]

How do you slice the sparse array?

I don't know how simple it is to do. Quickly like that, I guess that instead of using sliding windows we could offset the indices with the coordinates of the start of the slice? e.g. if you have a rank-3 sparse array and you slice it to its second matrix, the offset to apply to indices before returning an element would be: [1, 0, 0]... Or simply create (sub)copies of indices and values so you filter indices matching [1, x, y] and use [x, y] as the new slice indices, and so on. I haven't been deeply through it though so maybe I'm missing something else.

We can also disable slicing for non-tensor sparse ndarrays but if we do so, iterating though elements won't work.

What does NdArray<T> get(long... coordinates) mean?

It returns the n-dimensional elements at these coordinates. For a rank-3 array, m.get(0, 1) will return the second vector of the first matrix. So basically, it is slicing on a single point of each indexed dimension.

Same issue with NdArray<T> set(NdArray<T> src, long... coordinates)

Again I would opt for simplicity and only support read operations on non-tensor sparse arrays, wdyt?

karllessard avatar Jul 14 '21 14:07 karllessard

We punted on this issue in Tribuo, so our sparse vectors and matrices have a fixed index structure, but the values can change. That's turned out to work quite well for some things, but it can be a confusing restriction. And exploiting that restriction makes it harder to develop in the future.

I think making the sparse tensor read only from Java and allowing the TF graph to manipulate it if necessary seems like a reasonable compromise.

Craigacp avatar Jul 14 '21 16:07 Craigacp