sparse icon indicating copy to clipboard operation
sparse copied to clipboard

Store linearized indices instead of full coordinates

Open hameerabbasi opened this issue 6 years ago • 6 comments

I was thinking of replacing the full coordinates (coords) with just a 1-D array of linearised indices.

Pros:

  • More memory efficient (this gives back a lot of benefits of #158)
  • Some O(N) operations become O(1) such as reshape, flatnonzero.
  • COO becomes a special case of the GCRS/GCCS scheme, which means we can drop CSD and go with GCRS/GCCS, but with the difference that any collection of axes (not just rows/columns) can be compressed.
  • CSR/CSC are represented exactly as in SciPy/other libraries (instead of having 2-D indices)

Cons:

  • Adds a small (my intuition says almost negligible) performance penalty in many operations.
  • Some O(1) operations become O(N) such as nonzero, the one-argument version of where.

cc @mrocklin, @stefanv

hameerabbasi avatar Jul 03 '18 08:07 hameerabbasi

Would it be tenable to support both indexing schemes and give users the option to choose?

On Tue, Jul 3, 2018, 1:06 AM Hameer Abbasi [email protected] wrote:

I was thinking of replacing the full coordinates (coords) with just a 1-D array of linearised indices.

Pros:

  • More memory efficient (this gives back a lot of benefits of #158 https://github.com/pydata/sparse/pull/158)
  • Some O(N) operations become O(1) such as reshape, flatnonzero.
  • COO becomes a special case of the GCRS/GCCS https://ieeexplore.ieee.org/document/7237032/ scheme, which means we can drop CSD and go with GCRS/GCCS, but with the difference that any collection of axes (not just rows/columns) can be compressed.
  • CSR/CSC are represented exactly as in SciPy/other libraries

Cons:

  • Adds a small (my intuition says almost negligible) performance penalty in many operations.
  • Some O(1) operations become O(N) such as nonzero, the one-argument version of where.

cc @mrocklin https://github.com/mrocklin, @stefanv https://github.com/stefanv

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm20WTqGziuAMyE2EsEL0mqIflAKTYaks5uCyYOgaJpZM4VAeim .

ahwillia avatar Jul 03 '18 15:07 ahwillia

Would it be tenable to support both indexing schemes and give users the option to choose?

COO arrays will be indexed exactly as before. This is about storage, not indexing.

hameerabbasi avatar Jul 03 '18 15:07 hameerabbasi

Yes but if there are performance tradeoffs then it might be nice to support both. Though maybe those tradeoffs are negligible?

ahwillia avatar Jul 03 '18 15:07 ahwillia

Yes, the performance loss in most cases is negligible according to my intuition. But keeping both incurs technical debt, or at least maintenance debt.

hameerabbasi avatar Jul 03 '18 16:07 hameerabbasi

FWIW, this is the approach I used for sparray: https://github.com/perimosocordiae/sparray/blob/master/sparray/flat.py#L20

I found it to be quite convenient, though it does mean you may need more integer bits to represent the flat indices.

perimosocordiae avatar Jul 06 '18 20:07 perimosocordiae

@perimosocordiae Currently, we use np.intp anyway, but due to the way things are designed, the maximum supported size is 2 ** 64 - 1 anyway, and I predict it might decrease to 2 ** 63 - 1.

So it's a win for memory either way. In any case it seems like you're +1 on this. :-) Thanks a bunch for the feedback.

hameerabbasi avatar Jul 06 '18 20:07 hameerabbasi