opteryx
opteryx copied to clipboard
β¨ string equality experiment
Hereβs an optimized Cython implementation using memcmp for byte-wise comparison instead of Python slicing. This should be even faster because it avoids unnecessary slicing operations and compares raw memory directly.
Updated Cython Code
Save this in string_match.pyx:
cimport cython from libc.stdint cimport int32_t, uint8_t from libc.string cimport memcmp import numpy cimport numpy as np
@cython.boundscheck(False) @cython.wraparound(False) def string_array_equals(uintptr_t offsets_ptr, uintptr_t data_ptr, Py_ssize_t size, bytes value): """ Compare each string in an Arrow array to a literal string using memcmp.
Parameters:
offsets_ptr: uintptr_t
Pointer to the Arrow string offset buffer.
data_ptr: uintptr_t
Pointer to the Arrow string data buffer.
size: Py_ssize_t
Number of elements in the array.
value: bytes
The string value to compare against.
Returns:
np.ndarray
A boolean mask where True indicates a match.
"""
cdef int32_t[:] offsets = <int32_t[:size + 1]> offsets_ptr # Arrow offsets buffer
cdef uint8_t *data = <uint8_t *> data_ptr # Arrow string data buffer
cdef np.ndarray[np.uint8_t, ndim=1] mask = numpy.zeros(size, dtype=numpy.uint8)
cdef Py_ssize_t i, start, end, length, value_len = len(value)
cdef uint8_t *value_ptr = <uint8_t *> value # Get pointer to the literal string
for i in range(size):
start = offsets[i]
end = offsets[i + 1]
length = end - start
if length == value_len:
# Use memcmp for efficient byte-wise comparison
if memcmp(data + start, value_ptr, value_len) == 0:
mask[i] = 1
return mask
Python Wrapper
Now, create a Python function that extracts Arrow buffers and calls the optimized Cython function.
import pyarrow as pa import numpy as np from string_match import string_array_equals # Import the compiled Cython function
def match_arrow_string_array(arr: pa.StringArray, value: str) -> np.ndarray: """ Compare each string in a PyArrow array to a literal string using memcmp.
Parameters:
arr: pa.StringArray
The PyArrow string array.
value: str
The string value to compare against.
Returns:
np.ndarray
A boolean mask where True indicates a match.
"""
buf_offsets = arr.buffers()[1] # Get the offset buffer
buf_data = arr.buffers()[2] # Get the string data buffer
size = len(arr)
mask = string_array_equals(buf_offsets.address, buf_data.address, size, value.encode('utf-8'))
return mask.astype(bool) # Convert to boolean for easier usage
Testing the Function
Create a PyArrow string array
arr = pa.array(["apple", "banana", "cherry", "apple", "date"], type=pa.string())
Find matches against "apple"
mask = match_arrow_string_array(arr, "apple")
print(mask) # Expected Output: [ True, False, False, True, False ]
Why This is Faster
β Uses memcmp: Byte-wise comparison is highly optimized in C. β Avoids slicing: Instead of creating Python slices, we compare directly in memory. β Zero-copy operation: No unnecessary data conversion or memory duplication. β NumPy mask output: Easy to integrate into further vectorized operations.
Would you like to further optimize this for case-insensitive matching or batch comparisons against multiple values?