sparse
sparse copied to clipboard
Store linearized indices instead of full coordinates
I was thinking of replacing the full coordinates (coords
) with just a 1-D array of linearised indices.
Pros:
- More memory efficient (this gives back a lot of benefits of #158)
- Some O(N) operations become O(1) such as
reshape
,flatnonzero
. - COO becomes a special case of the GCRS/GCCS scheme, which means we can drop CSD and go with GCRS/GCCS, but with the difference that any collection of axes (not just rows/columns) can be compressed.
- CSR/CSC are represented exactly as in SciPy/other libraries (instead of having 2-D indices)
Cons:
- Adds a small (my intuition says almost negligible) performance penalty in many operations.
- Some O(1) operations become O(N) such as
nonzero
, the one-argument version ofwhere
.
cc @mrocklin, @stefanv
Would it be tenable to support both indexing schemes and give users the option to choose?
On Tue, Jul 3, 2018, 1:06 AM Hameer Abbasi [email protected] wrote:
I was thinking of replacing the full coordinates (coords) with just a 1-D array of linearised indices.
Pros:
- More memory efficient (this gives back a lot of benefits of #158 https://github.com/pydata/sparse/pull/158)
- Some O(N) operations become O(1) such as reshape, flatnonzero.
- COO becomes a special case of the GCRS/GCCS https://ieeexplore.ieee.org/document/7237032/ scheme, which means we can drop CSD and go with GCRS/GCCS, but with the difference that any collection of axes (not just rows/columns) can be compressed.
- CSR/CSC are represented exactly as in SciPy/other libraries
Cons:
- Adds a small (my intuition says almost negligible) performance penalty in many operations.
- Some O(1) operations become O(N) such as nonzero, the one-argument version of where.
cc @mrocklin https://github.com/mrocklin, @stefanv https://github.com/stefanv
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/sparse/issues/164, or mute the thread https://github.com/notifications/unsubscribe-auth/AAm20WTqGziuAMyE2EsEL0mqIflAKTYaks5uCyYOgaJpZM4VAeim .
Would it be tenable to support both indexing schemes and give users the option to choose?
COO
arrays will be indexed exactly as before. This is about storage, not indexing.
Yes but if there are performance tradeoffs then it might be nice to support both. Though maybe those tradeoffs are negligible?
Yes, the performance loss in most cases is negligible according to my intuition. But keeping both incurs technical debt, or at least maintenance debt.
FWIW, this is the approach I used for sparray: https://github.com/perimosocordiae/sparray/blob/master/sparray/flat.py#L20
I found it to be quite convenient, though it does mean you may need more integer bits to represent the flat indices.
@perimosocordiae Currently, we use np.intp
anyway, but due to the way things are designed, the maximum supported size is 2 ** 64 - 1
anyway, and I predict it might decrease to 2 ** 63 - 1
.
So it's a win for memory either way. In any case it seems like you're +1 on this. :-) Thanks a bunch for the feedback.