ENH: np.unique: support hash based unique for string dtype
Description
This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.
Benchmark Results
The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described https://github.com/numpy/numpy/pull/26018#issuecomment-2012427568
import random
import string
import time
import numpy as np
import polars as pl
chars = string.ascii_letters + string.digits
arr = np.array(
[
''.join(random.choices(chars, k=random.randint(5, 10)))
for _ in range(1_000)
] * 1_000_000,
dtype='T',
)
np.random.shuffle(arr)
time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))
Result
unique count (hash based): 1000
33.583 secs
unique count (numpy main): 1000
498.011 secs
unique count (polars): 1000
74.023 secs
close #28364
Everything is ready now!!
Thanks for looking at this! Ping @adrinjalali in case you have time/interest thinking about this. I need to dive into this deeper, though. So below some thoughts only.
For string/bytes (old style), it would seem reasonable to store a reference/pointer into the map and template the hash/equal function of the map appropriately (can also be done with a custom type as you did though).
For StringDType we also could store the full packed string, but operate on it again in hash/equal with the existing dtype/allocator. However, at the end we will want to do a full copy (to not keep referencing the old arrays strings).
I.e. I am curious if we can do the map more light-weight, both because i wonder if it may be simple, but also because copying to std::string is maybe slower and possibly wrong (embedded \0 characters).
It may well be that this thought only becomes useful if we refactor things first, though.
Thanks for your comments!
I agree that storing just pointers (or references) in the hash set sounds like a good idea—I’ll give that approach a try.
I also realized I hadn’t fully considered the cases where strings include embedded \0 characters. Since handling embedded \0 correctly in C++‘s std::string can be tricky, this is another strong reason to handle strings as pointers directly. I’ll explore this further and update accordingly.
I’ve updated the implementation to store pointers in the unordered_set.
In the original PR, I attempted to unify the implementations using a single templated unique function. However, considering the complexity of the code, I’ve split this into three separate functions: unique_integer, unique_string, and unique_vstring.
This change has improved the execution speed from the original PR’s 49 seconds down to 34 seconds!!
Wow, cool!
Totally missed this until now. Will try to give it a careful look.
@seberg @ngoldbaum I've addressed all comments received so far.
Sorry this didn't quite make it in time for NumPy 2.3, but i'm going to try to get this merged ASAP so we have time to iterate in-tree as needed. I'll try to give this another once-over sometime this week.
I also want to say that this is a really impressive contribution and any delay on my part is because I have to focus on other stuff and it takes a lot of time and attention to properly review CPython C API code rather than me being unenthusiastic about getting this improvement into NumPy.
Also you might be interested in https://github.com/numpy/numpy/issues/29229
Thanks so much for the thorough review! This was my first PR to NumPy, and your detailed feedback really helped me bring it to completion.
I'll take a look at #29229 as you suggested — I'm also interested in #28363, which I think I can complete in a similar way to this PR.
Let's bring this in, thanks @math-hiyoko! Looking forward to future contributions.