opteryx icon indicating copy to clipboard operation
opteryx copied to clipboard

✨ string equality experiment

Open joocer opened this issue 10 months ago β€’ 0 comments
trafficstars

Here’s an optimized Cython implementation using memcmp for byte-wise comparison instead of Python slicing. This should be even faster because it avoids unnecessary slicing operations and compares raw memory directly.


Updated Cython Code

Save this in string_match.pyx:

cimport cython from libc.stdint cimport int32_t, uint8_t from libc.string cimport memcmp import numpy cimport numpy as np

@cython.boundscheck(False) @cython.wraparound(False) def string_array_equals(uintptr_t offsets_ptr, uintptr_t data_ptr, Py_ssize_t size, bytes value): """ Compare each string in an Arrow array to a literal string using memcmp.

Parameters:
    offsets_ptr: uintptr_t
        Pointer to the Arrow string offset buffer.
    data_ptr: uintptr_t
        Pointer to the Arrow string data buffer.
    size: Py_ssize_t
        Number of elements in the array.
    value: bytes
        The string value to compare against.

Returns:
    np.ndarray
        A boolean mask where True indicates a match.
"""
cdef int32_t[:] offsets = <int32_t[:size + 1]> offsets_ptr  # Arrow offsets buffer
cdef uint8_t *data = <uint8_t *> data_ptr  # Arrow string data buffer
cdef np.ndarray[np.uint8_t, ndim=1] mask = numpy.zeros(size, dtype=numpy.uint8)

cdef Py_ssize_t i, start, end, length, value_len = len(value)
cdef uint8_t *value_ptr = <uint8_t *> value  # Get pointer to the literal string

for i in range(size):
    start = offsets[i]
    end = offsets[i + 1]
    length = end - start

    if length == value_len:
        # Use memcmp for efficient byte-wise comparison
        if memcmp(data + start, value_ptr, value_len) == 0:
            mask[i] = 1

return mask

Python Wrapper

Now, create a Python function that extracts Arrow buffers and calls the optimized Cython function.

import pyarrow as pa import numpy as np from string_match import string_array_equals # Import the compiled Cython function

def match_arrow_string_array(arr: pa.StringArray, value: str) -> np.ndarray: """ Compare each string in a PyArrow array to a literal string using memcmp.

Parameters:
    arr: pa.StringArray
        The PyArrow string array.
    value: str
        The string value to compare against.

Returns:
    np.ndarray
        A boolean mask where True indicates a match.
"""
buf_offsets = arr.buffers()[1]  # Get the offset buffer
buf_data = arr.buffers()[2]  # Get the string data buffer

size = len(arr)
mask = string_array_equals(buf_offsets.address, buf_data.address, size, value.encode('utf-8'))

return mask.astype(bool)  # Convert to boolean for easier usage

Testing the Function

Create a PyArrow string array

arr = pa.array(["apple", "banana", "cherry", "apple", "date"], type=pa.string())

Find matches against "apple"

mask = match_arrow_string_array(arr, "apple")

print(mask) # Expected Output: [ True, False, False, True, False ]


Why This is Faster

βœ… Uses memcmp: Byte-wise comparison is highly optimized in C. βœ… Avoids slicing: Instead of creating Python slices, we compare directly in memory. βœ… Zero-copy operation: No unnecessary data conversion or memory duplication. βœ… NumPy mask output: Easy to integrate into further vectorized operations.

Would you like to further optimize this for case-insensitive matching or batch comparisons against multiple values?

joocer avatar Jan 17 '25 17:01 joocer