numpy icon indicating copy to clipboard operation
numpy copied to clipboard

ENH: np.unique: support hash based unique for string dtype

Open math-hiyoko opened this issue 8 months ago • 8 comments

Description

This PR introduces hash-based uniqueness extraction support for NPY_STRING, NPY_UNICODE, and NPY_VSTRING types in NumPy's np.unique function.
The existing hash-based unique implementation, previously limited to integer data types, has been generalized to accommodate additional data types including string-related ones. Minor refactoring was also performed to improve maintainability and readability.

Benchmark Results

The following benchmark demonstrates significant performance improvement from the new implementation.
The test scenario (1 billion strings array) follows the experimental setup described https://github.com/numpy/numpy/pull/26018#issuecomment-2012427568

import random
import string
import time

import numpy as np
import polars as pl

chars = string.ascii_letters + string.digits
arr = np.array(
    [
        ''.join(random.choices(chars, k=random.randint(5, 10)))
        for _ in range(1_000)
    ] * 1_000_000,
    dtype='T',
)
np.random.shuffle(arr)

time_start = time.perf_counter()
print("unique count (hash based): ", len(np.unique(arr)))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

time_start = time.perf_counter()
print("unique count (polars): ", len(pl.Series(arr).unique()))
time_elapsed = (time.perf_counter() - time_start)
print ("%5.3f secs" % (time_elapsed))

Result

unique count (hash based):  1000
33.583 secs
unique count (numpy main):  1000
498.011 secs
unique count (polars):  1000
74.023 secs

close #28364

math-hiyoko avatar Apr 18 '25 11:04 math-hiyoko

Everything is ready now!!

math-hiyoko avatar Apr 20 '25 12:04 math-hiyoko

Thanks for looking at this! Ping @adrinjalali in case you have time/interest thinking about this. I need to dive into this deeper, though. So below some thoughts only.

For string/bytes (old style), it would seem reasonable to store a reference/pointer into the map and template the hash/equal function of the map appropriately (can also be done with a custom type as you did though). For StringDType we also could store the full packed string, but operate on it again in hash/equal with the existing dtype/allocator. However, at the end we will want to do a full copy (to not keep referencing the old arrays strings).

I.e. I am curious if we can do the map more light-weight, both because i wonder if it may be simple, but also because copying to std::string is maybe slower and possibly wrong (embedded \0 characters). It may well be that this thought only becomes useful if we refactor things first, though.

seberg avatar Apr 22 '25 10:04 seberg

Thanks for your comments!

I agree that storing just pointers (or references) in the hash set sounds like a good idea—I’ll give that approach a try.

I also realized I hadn’t fully considered the cases where strings include embedded \0 characters. Since handling embedded \0 correctly in C++‘s std::string can be tricky, this is another strong reason to handle strings as pointers directly. I’ll explore this further and update accordingly.

math-hiyoko avatar Apr 22 '25 12:04 math-hiyoko

I’ve updated the implementation to store pointers in the unordered_set.

In the original PR, I attempted to unify the implementations using a single templated unique function. However, considering the complexity of the code, I’ve split this into three separate functions: unique_integer, unique_string, and unique_vstring.

This change has improved the execution speed from the original PR’s 49 seconds down to 34 seconds!!

math-hiyoko avatar Apr 24 '25 15:04 math-hiyoko

Wow, cool!

Totally missed this until now. Will try to give it a careful look.

ngoldbaum avatar Apr 24 '25 16:04 ngoldbaum

@seberg @ngoldbaum I've addressed all comments received so far.

math-hiyoko avatar May 04 '25 07:05 math-hiyoko

Sorry this didn't quite make it in time for NumPy 2.3, but i'm going to try to get this merged ASAP so we have time to iterate in-tree as needed. I'll try to give this another once-over sometime this week.

I also want to say that this is a really impressive contribution and any delay on my part is because I have to focus on other stuff and it takes a lot of time and attention to properly review CPython C API code rather than me being unenthusiastic about getting this improvement into NumPy.

ngoldbaum avatar Jun 09 '25 15:06 ngoldbaum

Also you might be interested in https://github.com/numpy/numpy/issues/29229

ngoldbaum avatar Jun 18 '25 19:06 ngoldbaum

Thanks so much for the thorough review! This was my first PR to NumPy, and your detailed feedback really helped me bring it to completion.

I'll take a look at #29229 as you suggested — I'm also interested in #28363, which I think I can complete in a similar way to this PR.

math-hiyoko avatar Jun 19 '25 17:06 math-hiyoko

Let's bring this in, thanks @math-hiyoko! Looking forward to future contributions.

ngoldbaum avatar Jun 20 '25 13:06 ngoldbaum