sparse icon indicating copy to clipboard operation
sparse copied to clipboard

Add advanced indexing support

Open woodmd opened this issue 6 years ago • 12 comments

I would like to have support for advanced indexing for both retrieval and assignment. Ideally I was hoping to find something that could serve as a drop-in replacement for numpy.ndarray for these types of operations. Is this functionality something that would be in the scope of this library? Are there any thoughts on how it should be implemented?

On a related note I noticed that COO is currently immutable and thus doesn't doesn't allow item assignment. However I wonder if one could support assignment by having COO make an in-place copy of itself. Of course this will be extremely inefficient for updating a single element but when addressing a large number of elements in parallel the overhead from the copy should be more manageable. Of course in the documentation you could stress that setting elements of COO individually is not recommended.

woodmd avatar Mar 01 '18 08:03 woodmd

Hi!

Advanced indexing support is definitely on the to-do list (see #1), however it would be extremely inefficient to assign with COO, as you noted.

It isn't a high priority right now, but a PR is always welcome. Also, since you have shown interest, the priority went up.

If you want fast writes, DOK is the way to go, which you will then convert to COO once you are done writing to it. It supports assigning an array simultaneously, but the downside is that it doesn't support advanced indexing either. Unfortunately, this is implemented in Python (and so isn't super fast) and does not accept another COO/DOK as input, but does accept Numpy Arrays for slices.

If we want to make advanced indexing super-fast for both COO and DOK, we will need to re-write DOK in C++ to use std::unordered_map. For COO as well, even though changing the data structure won't be needed, a large part of indexing will need to be in C++.

If you want to interleave reads and writes, DOK is the best choice. However, currently it only supports accessing single elements, though that isn't too hard to extend.

Hope this helps!

Making COO itself mutable is out of scope, I'm afraid. It just isn't meant to support writes. Instead, in your scenario, I suggest you construct a DOK, mutate it, and take "snapshots" to COO whenever you want to perform operations other than writes.

hameerabbasi avatar Mar 01 '18 11:03 hameerabbasi

Howvever, if assigning to COO won't change its sparsity structure, then that is in scope.

hameerabbasi avatar Mar 01 '18 11:03 hameerabbasi

I'm blocked on this by numba/numba#2560 or #126.

hameerabbasi avatar Apr 23 '18 02:04 hameerabbasi

cc @woodmd would it be good enough if we converted on-the-fly between the most optimal formats for each operation? So writing would automatically convert it to DOK, element-wise ops or reads to COO, for example?

hameerabbasi avatar Apr 28 '18 18:04 hameerabbasi

I don't think having element-wise update is critical for us. The use-case would generally be reading/updating a very large number of elements at once. Probably for updating the easiest approach would be to instantiate a new COO on the fly.

woodmd avatar Apr 30 '18 16:04 woodmd

The following indexing pattern no longer works on master. I believe the problem originates at commit 222fb8c9ac0ed8cf73dbc6ac2d20dba3f2786fc7:

import sparse
import numpy as np
s = sparse.random((100, 100, 100), density=.1)
idx = np.random.permutation(100)
s[idx]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)

.../sparse/coo/indexing.py in getitem(x, index)
     54 
     55     # zip_longest so things like x[..., None] are picked up.
---> 56     if len(index) != 0 and all(ind == slice(0, dim, 1) for ind, dim in zip_longest(index, x.shape)):
     57         return x
     58 

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

ahwillia avatar May 05 '18 22:05 ahwillia

This is correct, it's a known issue, and it was discussed in the PR related to that commit.

We sacrificed this feature for performance. If you're interested, I can walk you through how to add full-blown advanced indexing.

I have no immediate plans to do this myself, however.

hameerabbasi avatar May 06 '18 05:05 hameerabbasi

I'd definitely be happy to help with this! (As long as you don't think it would take too much effort on your part to get me started on it.)

ahwillia avatar May 06 '18 20:05 ahwillia

Great! It is a bit of work to understand, but not too much work to actually implement. I'll say this right now: This isn't a cup of tea for a beginning contributor to this library, and it's completely okay if you're not okay doing this. That said, I'm more than willing to help guide you through it, in case you have questions or need any kind of support (algorithmic or technical). Don't hesitate to ask anything (either on this issue or on any related PR you create), and I'll help where I can.

Most of what you need is in https://github.com/pydata/sparse/blob/8f2a9aebe595762eace6bc48531119462f979e21/sparse/coo/indexing.py Here is a brief sketch of the indexing algorithm:

  • The main idea is to create a sort of "mask" such that output.data = input.data[mask]
  • Since indices are sorted in lexographical order, it tries to find starts/ends for an integer slice via binary search.
  • It does this for all supplied indices.
  • For slices, it repeats the above for every possible integer in the slice.
  • When it detects that the algorithm is becoming too slow (or if it reaches the ending index), it converts the starts/stops to a mask, and switches to filtering that mask using the slices directly.

A couple of quick side notes:

  • All indices represented as slices of an array (n_indices, 3) with each 1-D slice being start/stop/step into input.coords.
  • Integers are represented as a slice, i:i+1:1

What you will need to do (no pressure):

  • Refactor the algorithm to return output.coords and output.shape as well.
  • Broadcast advanced indices to a common shape.
  • Iterate over zip(advanced indices) and treat them as if they were single indices.
  • Combine all the coords and data masks.

A couple of things you might need to know during the implementation phase:

  • It's best to flatten all advanced coordinates, concatenate them into a single 2-D array, and work on it that way. This way, there won't be a million different Numba-compiled versions for every dimension and number of advanced indices.
  • You might need to keep track of where the "advanced index dimension" goes. Specifically, if they're all next to each other, it goes at that place. If not, it goes at the beginning. https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#combining-advanced-and-basic-indexing.

hameerabbasi avatar May 07 '18 07:05 hameerabbasi

I have added outer indexing for a single array in #172. If outer indexing for multiple arrays is needed then we might add it after NEP 21 is accepted.

hameerabbasi avatar Aug 06 '18 15:08 hameerabbasi

@ahwillia Seems like an older thread, but I was wondering if there's been any progress made on this front?

k-a-mendoza avatar Nov 08 '19 18:11 k-a-mendoza

#343 adds this for the one-dimensional case.

hameerabbasi avatar May 26 '20 09:05 hameerabbasi