arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Improvements to groupby

Open reuster986 opened this issue 5 years ago • 10 comments

Desired functionality:

  • Ability to groupby multiple keys (requires multi-column argsort from #94)
  • get_group(key) function that returns the segment corresponding to a key
  • expand(vals) function that takes an array of one value per segment and returns a segmented array with each key's segment set to the corresponding value

reuster986 avatar Sep 06 '19 20:09 reuster986

@reuster986 is this issue solved now with the latest sort and groupie code?

mhmerrill avatar Sep 21 '19 16:09 mhmerrill

The first one is done, but I still need to do the other two.

reuster986 avatar Sep 21 '19 18:09 reuster986

I'm leaning towards calling the expand function broadcast. In NumPy, "broadcasting" means "taking a lower-dimensional object and replicating it to fill the shape of a higher-dimensional object". For example, NumPy lets you add an n-long column vector to an n x m matrix by replicating the vector m times into an n x m matrix where each row is constant. If you view a GroupBy as a sparse matrix (or tensor), the proposed expand function broadcasts a per-row constant to the non-zero elements of each row.

A broadcast method of the GroupBy class could look like:

def broadcast(self, vals):
    if not isinstance(vals, pdarray):
        raise ValueError("Vals must be pdarray")
    if vals.size != self.segments.size:
        raise ValueError("Must have one value per segment")
    temp = zeros(self.size, vals.dtype)
    if vals.size == 0:
        return temp
    diffs = ak.concatenate((ak.array([vals[0]]), vals[1:] - vals[:-1]))
    temp[self.segments] = diffs
    return cumsum(temp)

The performance would be somewhat improved by moving the logic over to the Chapel side.

reuster986 avatar Dec 23 '19 20:12 reuster986

The get_group() method should maybe just be the __getitem__() method of a GroupBy class, and it should do a gather of all the values in the indexed segments, returning a new GroupBy object.

reuster986 avatar Dec 23 '19 20:12 reuster986

A common usage pattern with GroupBy is to group by n levels to dedupe entries, and then group the resulting unique_keys by the first n-1 levels to get the desired GroupBy object. An example is constructing a graph, where one would first do u, v = GroupBy(src, dst).unique_keys to unique the edges, then csr = GroupBy(u) to form the compressed sparse row representation of the adjacency matrix. However, this procedure is inefficient because the second GroupBy() incurs a sort on already-sorted data. Three potential, non-exclusive solutions:

  • Expose find_segments so that users can call it directly on a (list of) sorted array(s)
  • Add checks to argsort and coargsort to return the identity permutation if the input is already sorted
  • Add an assume_sorted=False keyword argument to GroupBy.__init__ that, when True, bypasses the sorting step

reuster986 avatar Dec 23 '19 20:12 reuster986

there are isSorted checks already in place, they were added when I added the set operations.

https://github.com/mhmerrill/arkouda/blob/d9882d50f8853ec4f2a2e304daf7d1aba1decb09/src/RadixSortLSD.chpl#L115-L124 https://github.com/mhmerrill/arkouda/blob/d9882d50f8853ec4f2a2e304daf7d1aba1decb09/src/RadixSortLSD.chpl#L229-L238

mhmerrill avatar Dec 23 '19 21:12 mhmerrill

That's good, but I think that only applies to argsort, not coargsort.

reuster986 avatar Dec 23 '19 22:12 reuster986

That's good, but I think that only applies to argsort, not coargsort.

Any calls to sort ranks or keys will get checked. So I think coargsort is included.

mhmerrill avatar Dec 23 '19 23:12 mhmerrill

@reuster986 do we still need this issue open?, it not please close it.

mhmerrill avatar Apr 12 '21 20:04 mhmerrill

The second of the three points is not done yet.

reuster986 avatar Apr 12 '21 21:04 reuster986