arkouda
arkouda copied to clipboard
Closes #3009: `indexof1d` to handle null values
This PR (closes #3009) refactors indexof1d
to use find
since they have similar functionality and find
is fairly optimized and correctly handles null values (once we dropna=False
to the Groupby
). The one major difference is when there are duplicates in the search space. find
will only return the index of the first occurrence, but indexof1d
returns the indices of all occurrences. To enable this, I added an all_occurrences
flag to find
The approach I took for returning all occurrences involved adding a segmented mink/maxk
, and I went back and forth on whether it should be user facing (see #3170). I implemented this by permuting the values and calling the existing mink/maxk
on each segment. It's unlikely that this is the most efficient approach, but my goal was to focus on correctness this go around and optimize later if needed.
Wrote tests for indexof1d
both for the reproducer and in general.
I'm especially interested in feedback from @brandon-neth to make sure I haven't missed any intended functionality of indexof1d
, since they're the original author.
based on how indexof1d
is used, see https://github.com/Bears-R-Us/arkouda/pull/3109#discussion_r1599881043, I think I should add a flag to find
which toggles whether indices that aren't found return -1
or are excluded from the output. Either way, I think we should hold off merging this until #3109 is merged so I can make sure those tests still pass. For those reasons, I'm converting this to draft for the time being
updated to add a remove_missing
flag to find
. @brandon-neth, I think I changed enough to warrant a rereview if you don't mind
I think this example is really good for understanding the broad strokes of what the changes to find
look like:
>>> select_from = ak.arange(10)
>>> arr1 = select_from[ak.randint(0, select_from.size, 20, seed=10)]
>>> arr2 = select_from[ak.randint(0, select_from.size, 20, seed=11)]
# remove some values to ensure we have some values
# which don't appear in the search space
>>> arr2 = arr2[arr2 != 9]
>>> arr2 = arr2[arr2 != 3]
# find with defaults (all_occurrences and remove_missing both False)
>>> ak.find(arr1, arr2)
array([-1 -1 -1 0 1 -1 -1 -1 2 -1 5 -1 8 -1 5 -1 -1 11 5 0])
# set remove_missing to True, only difference from default
# is missing values are excluded
>>> ak.find(arr1, arr2, remove_missing=True)
array([0 1 2 5 8 5 11 5 0])
# set all_occurrences to True, the first index of each list
# is the first occurence and should match the default
>>> ak.find(arr1, arr2, all_occurrences=True).to_list()
[[-1],
[-1],
[-1],
[0, 4],
[1, 3, 10],
[-1],
[-1],
[-1],
[2, 6, 12, 13],
[-1],
[5, 7],
[-1],
[8, 9, 14],
[-1],
[5, 7],
[-1],
[-1],
[11, 15],
[5, 7],
[0, 4]]
# set both remove_missing and all_occurrences to True, missing values
# will be empty segments
>>> ak.find(arr1, arr2, remove_missing=True, all_occurrences=True).to_list()
[[],
[],
[],
[0, 4],
[1, 3, 10],
[],
[],
[],
[2, 6, 12, 13],
[],
[5, 7],
[],
[8, 9, 14],
[],
[5, 7],
[],
[],
[11, 15],
[5, 7],
[0, 4]]
also find
was written by bill and was hard to understand even before this PR, so I recommend reading thru the comments and working through some examples and print out arrays as you go.