Access non-zero indices of an array
It is useful (especially with sparse matrices) to know which indices are non-zero.
Proposal to add a new API function non-zero-indices which will return this information for any 1D+ array.
This operation needs to be efficiently implemented by underlying sparse matrix types.
For :vectorz at least, the following API makes sense and would most efficient:
(non-zero-indices M)
=> [(int-array [1 2 5 10 100])
(int-array [50 99])
(int-array [16 18 88])]
i.e. the function returns a sequence of non-zero indexes for each row.
In general, we can't control the exact types of index lists produced by the underlying implementations, so we might need to be clever about how we support these. Possibly provide an IPersistentVector wrapper for the underlying index types?
Ok, I think we may want to go ahead and have a non-zeros function instead that would return both the indices and values.
(def M (mtx/matrix [[0 2 1][0 1 0]]))
=> #'user/M
(non-zeros M)
=> {[0 1] 2, [0 2] 1, [1 1] 1}
My use cases are:
- modifying the value of all non-zero elements, and
- using non-zero indices to reference an external index.
The first is an optimization and convenience. We could of course map a function over the elements that returns a zero if the element is zero. The second is mostly convenience but benefits from an efficient implementation. I write clustering and topic models code and I'd like to easily get non-zero indices so I can use them to lookup feature names store in a separate collection with the same indices.
Please let me know if this sounds like the right direction. To implement this we'd add nonZeros to INDArray in vectorz and add a non-zeros function to a PNonZeros protocol?
A naive implementation using some nice core.matrix functions already implemented.
(defn non-zeros
"Gets the non-zero indices of an array mapped to the values."
([m] (into {}
(filter (comp not zero? second)
(map vector (index-seq m) (eseq m))))))
My main use case at this point would be with core.matrix vectors. I'd just want a sequence of indices. That would fit with Mike's proposal. However, my vectors and matrices are rather small (a few hundred on a side) and therefore I don't bother with sparse matrices. Easy enough to do it myself. Maybe sparse matrices are a more important use case for this functionality. -Marshall
@mattrepl Can you be a bit more specific about your use cases?
For example: If you just want to modify non-zero values, then perhaps an emap-non-zero function would be better. It would certainly be far more efficient than anything that needs to construct a big map of index pairs to values (constructing lots of objects is always going to be expensive because of GC issues). Even a simple (emap #(if (not (== 0 %)) .....) M) would probably be quicker (unless your matrix is extremely sparse).
non-zeros is OK as a name, but maybe non-zero-map would be better to make the function call more self-explanatory? This would be consistent with non-zero-count which I have recently added, and we could also have non-zero-indices to return an array of indices in a manner similar to the API I outlined above. Does that make sense?
I'm not sure what details about my use cases will help, but I'll expand on the example for the second case.
Agreed re: emap-non-zero. That should be more efficient, especially if there is a version for mutable matrices. That covers the first use case. Colt does this too.
The second use case involves using matrix indices as indices into other data structures. Say I have a large, sparse matrix representing frequency counts of terms in documents. Each row corresponds to a document. I also have an indexable collection of term names. For each document, I'd like to find the five most frequent terms. The non-zeros-map function returns a map that I can easily sort by value. I could then group the sorted map entries by row-index and take the first 5 entries for each row-index. Finally, I'd use the column indices from the entries to lookup the term's name in the term name collection.
Other specific examples of this use case are the serialization of a matrix into a coordinate text file or a graph adjacency list.
The use cases are then:
- efficient modification of sparse matrices when skipping zero elements
- getting non-zero indices and values for a purpose other than modifying the matrix
Sometimes only the non-zero indices may be needed, but I don't think including the non-zero values adds sufficient overhead to warrant having both non-zero-indices and non-zero-map. If a particular use only requires indices, they're only a (keys (non-zero-map m)) away.
Cool, thanks for the use cases. I agree that non-zero-map makes a lot of sense here. Sounds like a good one to include.
I think we will want non-zero-indices as well though. It's going to be much more efficient for many other use cases.
(keys (non-zero-map)) will be moderately expensive (I guess about 200ns per non-zero element?). It will need to box values that just get thrown away for example, and map construction itself isn't particularly cheap (much more expensive than just producing an int[] array....)
@mars0i how do you use the non-zero indices?
It turns out that I don't seem to be pulling out all nonzero indices in any code right now. I think I did that in an earlier version of my project. Sorry about saying that I would use nonzero indices. I still think it's the sort of thing that could be useful to me at some point. I'm implementing neural nets with matrices. I gradually "add nodes" to the network by changing zeros to ones in a "mask" vector. Whether a node can be added depends on what other nodes exist, though, so the code could iterate through nonzero indices in the mask vector to see which nodes already exist. At present I do this work in advance by associating with each node a sequence containing just those indices that need to be checked in a particular situation.
I may or may not use this in code, but I'm finding that pulling out nonzero indices is useful for testing at the repl. e.g. when I add a new function that adds certain nonzero values to a sparse matrix that's too large to look at all at once, I want to verify that it's happening correctly. I guess the interface doesn't matter very much to me, though, at this point; I just want to be able to see what comes out.
I'm finding working with the current non-zero-indices to be not useful for my use cases stated above. I ultimately want expanded indices over all dimensions but instead get a seq of vectors of indices on a single dimension.
(def m (matrix [[1 0 3][2 4 0]]))
(non-zero-indices m)
;; Currently produces: [[0 2] [0 1]]
;; rather than: [[0 0] [0 2] [1 0] [1 1]]
I realize it's not hard to transform the current result into the expaned form, but is there a reason non-zero-indices shouldn't do this for the user?
There's an efficiency reason: it is much cheaper to return indices in that form. For a L_M_N matrix you only need to allocate O(L_M) vectors instead of O(L_M*N).
Also it may be a more convenient form depending on precisely what you are doing: I've found this form more useful for doing row-wise computations for example.
Don't personally have a strong view though. Happy to consider alternative API suggestions.
I don't disagree with your stated reasons, but the indices in that form are only immediately useful if you are processing rows. Otherwise additional work is done by the caller to recover the outer index and I'm not sure the most common use cases for non-zero-indices are row-oriented.
Would be helpful to hear more opinions. I'm fine with keeping a few helper functions around for conversion if others prefer the current version.
I'm quite keen that the underlying protocols use the most efficient versions but certainly we can support different API / helper functions built on top - based on whatever is most helpful to users.
More on my use cases at this point: I find that I want something like non-zero-indices for examining matrices as well as vectors. Again, so far I would just use this for testing and experimentation, but I think that may actually be an important case. One of the benefits of Clojure is the ability to play around with code in the repl. But when a matrices get beyond a certain size, it's difficult to experiment with them at the repl, because you can't just let them display--you've got to write code to see what's in your matrix. In cases in which the matrices are sparse, though, something like non-zero-indices lets you quickly see what's in a matrix at the repl. That seems very valuable to me.
For this use, mattrepl's output format for non-zeros, non-zero-map given above is much more convenient than mikera's non-zero-indices format. However, the function also doesn't need to be extremely efficient for use at the repl. So mikera's proposal to use the indexes-by-rows format, with a wrapper to provide some version of mattrepl's index-vectors-plus-values output format would work fine for use case I'm describing.
(mattrepl, thanks for the nice little implementation of non-zeros (non-zero-map) above. That's going to be very useful to me in the next couple of weeks.)