awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Broadcasting string operations

Open Demirrr opened this issue 3 years ago • 4 comments

Description of new feature

Dear all,

Thank you for the great work. I was wondering whether broadcasting string operations would be useful for everyone, e.g.,

import awkward as ak
ar=ak.Array([['a','b','c'],['b','c'],['x','y','xyz']])
ar + 'a' # [['aa,'ba','ca'],['ba','ca'],['xa','ya','xyza']]
ar - 'a' # [[','b','c'],['b','c'],['x','y','xyz']]
ar - 'x' # [[','b','c'],['b','c'],['','y','yz']]

Currently, we observe the following ValueError

ValueError: no overloads for custom types: add(string, string)

(https://github.com/scikit-hep/awkward-1.0/blob/1.7.0/src/awkward/_connect/_numpy.py#L259)

Demirrr avatar Jan 31 '22 10:01 Demirrr

Just as a note, this will involve adding a function to ak.behavior[np.add, "string", "string"]. Here's what it looks like to add a broken function to that behavior:

>>> def broken(x, y):
...     raise Exception("broken!")
... 
>>> ak.behavior[np.add, "string", "string"] = broken

>>> ar + "a"
Traceback (most recent call last):
...
  File "<stdin>", line 2, in broken
Exception: broken!

Writing a non-broken function will involve removing the "__array__": "string" parameters from x and y, ak.concatenate on axis=-1 (because they're just arrays of lists of uint8 now), and repackaging the result with an "__array__": "string" parameter (to make the concatenated lists of uint8 into strings).

The addition case is straightforward because we already have a function for concatenation. But for subtraction? That's a pretty sophisticated function. It might be in line with what @martindurant is thinking about.

jpivarski avatar Feb 01 '22 01:02 jpivarski

I do concur that the subtraction case may seem to be not intitive and it may be subject to the interpretation. However, from my point of view, the substration case is not very different than the addition case and as it can be defined via the four following rules: 1.f(A,B) = A (because B not in A) 2. f(AB,B) = A (because B in A) 3.f(B,A) = B if A not in B ( see rule 1 ) 4.f(ABA,A) = AB (see rule 2. However, one may argue that the result should be BA. It depends on the starting point.

Demirrr avatar Feb 01 '22 08:02 Demirrr

I am not entirely sure of the usefulness of arithmetic string operations - python provides only for equality, in (contains), addition (concat) and multiplication (repeat concat).

I actually wish for a pandas-like accessor model, a set of functions that explicitly need to be given string-behaviour array(s) to work on. Maybe it could actually as an accessor (arr.str.upper(), arr.str.index("needle"), ...) or as a module ak.str. The latter fits the layout of ak better; either way, we have an explicit namespace of things that work on string and rely on strings not being just a list of uint8. I was thinking maybe to code it in Rust, which has native utf8 support.

martindurant avatar Feb 01 '22 14:02 martindurant

4.f(ABA,A) = AB (see rule 2. However, one may argue that the result should be BA. It depends on the starting point.

I do not have any use cases for string subtraction in mind, so the following could very well not be too useful in practice: I could also see f(ABA,A) = B as a potential outcome, removing all matches. That would lead to the question whether f(ABABA, ABA) should be AB or BA though, depending on the starting point again.

alexander-held avatar Feb 03 '22 07:02 alexander-held

These exist now: they're in the ak.str.* namespace (and implemented by pyarrow).

jpivarski avatar Jan 20 '24 01:01 jpivarski