arkouda icon indicating copy to clipboard operation
arkouda copied to clipboard

Closes #3009: `indexof1d` to handle null values

Open stress-tess opened this issue 9 months ago • 3 comments

This PR (closes #3009) refactors indexof1d to use find since they have similar functionality and find is fairly optimized and correctly handles null values (once we dropna=False to the Groupby). The one major difference is when there are duplicates in the search space. find will only return the index of the first occurrence, but indexof1d returns the indices of all occurrences. To enable this, I added an all_occurrences flag to find

The approach I took for returning all occurrences involved adding a segmented mink/maxk, and I went back and forth on whether it should be user facing (see #3170). I implemented this by permuting the values and calling the existing mink/maxk on each segment. It's unlikely that this is the most efficient approach, but my goal was to focus on correctness this go around and optimize later if needed.

Wrote tests for indexof1d both for the reproducer and in general.

I'm especially interested in feedback from @brandon-neth to make sure I haven't missed any intended functionality of indexof1d, since they're the original author.

stress-tess avatar May 09 '24 13:05 stress-tess

based on how indexof1d is used, see https://github.com/Bears-R-Us/arkouda/pull/3109#discussion_r1599881043, I think I should add a flag to find which toggles whether indices that aren't found return -1 or are excluded from the output. Either way, I think we should hold off merging this until #3109 is merged so I can make sure those tests still pass. For those reasons, I'm converting this to draft for the time being

stress-tess avatar May 14 '24 12:05 stress-tess

updated to add a remove_missing flag to find. @brandon-neth, I think I changed enough to warrant a rereview if you don't mind

stress-tess avatar May 17 '24 17:05 stress-tess

I think this example is really good for understanding the broad strokes of what the changes to find look like:

>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
# find with defaults (all_occurrences and remove_missing both False)
>>> ak.find(arr1, arr2)
array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])
 # set remove_missing to True, only difference from default
 # is missing values are excluded
 >>> ak.find(arr1, arr2, remove_missing=True)
array([0 1 2 5 8 5 11 5 0])
# set all_occurrences to True, the first index of each list
# is the first occurence and should match the default
>>> ak.find(arr1, arr2, all_occurrences=True).to_list()
[[-1],
 [-1],
 [-1],
 [0, 4],
 [1, 3, 10],
 [-1],
 [-1],
 [-1],
 [2, 6, 12, 13],
 [-1],
 [5, 7],
 [-1],
 [8, 9, 14],
 [-1],
 [5, 7],
 [-1],
 [-1],
 [11, 15],
 [5, 7],
 [0, 4]]
# set both remove_missing and all_occurrences to True, missing values
# will be empty segments
>>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list()
[[],
 [],
 [],
 [0, 4],
 [1, 3, 10],
 [],
 [],
 [],
 [2, 6, 12, 13],
 [],
 [5, 7],
 [],
 [8, 9, 14],
 [],
 [5, 7],
 [],
 [],
 [11, 15],
 [5, 7],
 [0, 4]]

also find was written by bill and was hard to understand even before this PR, so I recommend reading thru the comments and working through some examples and print out arrays as you go.

stress-tess avatar May 20 '24 18:05 stress-tess