arkouda
arkouda copied to clipboard
Improvements to groupby
Desired functionality:
- Ability to groupby multiple keys (requires multi-column argsort from #94)
- get_group(key) function that returns the segment corresponding to a key
- expand(vals) function that takes an array of one value per segment and returns a segmented array with each key's segment set to the corresponding value
@reuster986 is this issue solved now with the latest sort and groupie code?
The first one is done, but I still need to do the other two.
I'm leaning towards calling the expand
function broadcast
. In NumPy, "broadcasting" means "taking a lower-dimensional object and replicating it to fill the shape of a higher-dimensional object". For example, NumPy lets you add an n
-long column vector to an n x m
matrix by replicating the vector m
times into an n x m
matrix where each row is constant. If you view a GroupBy as a sparse matrix (or tensor), the proposed expand
function broadcasts a per-row constant to the non-zero elements of each row.
A broadcast
method of the GroupBy
class could look like:
def broadcast(self, vals):
if not isinstance(vals, pdarray):
raise ValueError("Vals must be pdarray")
if vals.size != self.segments.size:
raise ValueError("Must have one value per segment")
temp = zeros(self.size, vals.dtype)
if vals.size == 0:
return temp
diffs = ak.concatenate((ak.array([vals[0]]), vals[1:] - vals[:-1]))
temp[self.segments] = diffs
return cumsum(temp)
The performance would be somewhat improved by moving the logic over to the Chapel side.
The get_group()
method should maybe just be the __getitem__()
method of a GroupBy
class, and it should do a gather of all the values in the indexed segments, returning a new GroupBy
object.
A common usage pattern with GroupBy
is to group by n
levels to dedupe entries, and then group the resulting unique_keys
by the first n-1
levels to get the desired GroupBy
object. An example is constructing a graph, where one would first do u, v = GroupBy(src, dst).unique_keys
to unique the edges, then csr = GroupBy(u)
to form the compressed sparse row representation of the adjacency matrix. However, this procedure is inefficient because the second GroupBy()
incurs a sort on already-sorted data. Three potential, non-exclusive solutions:
- Expose
find_segments
so that users can call it directly on a (list of) sorted array(s) - Add checks to
argsort
andcoargsort
to return the identity permutation if the input is already sorted - Add an
assume_sorted=False
keyword argument toGroupBy.__init__
that, whenTrue
, bypasses the sorting step
there are isSorted
checks already in place, they were added when I added the set operations.
https://github.com/mhmerrill/arkouda/blob/d9882d50f8853ec4f2a2e304daf7d1aba1decb09/src/RadixSortLSD.chpl#L115-L124 https://github.com/mhmerrill/arkouda/blob/d9882d50f8853ec4f2a2e304daf7d1aba1decb09/src/RadixSortLSD.chpl#L229-L238
That's good, but I think that only applies to argsort
, not coargsort
.
That's good, but I think that only applies to
argsort
, notcoargsort
.
Any calls to sort ranks or keys will get checked. So I think coargsort is included.
@reuster986 do we still need this issue open?, it not please close it.
The second of the three points is not done yet.