hail icon indicating copy to clipboard operation
hail copied to clipboard

String split behavior inconsistent with Python

Open nawatts opened this issue 3 years ago • 4 comments

Hail version: 0.2.55-0d4ce0df2457

The behavior of the second parameter to StringExpression's split method is determined by its Java implementation and differs from Python string's split method.

The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded. -- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-

vs

If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made). -- https://docs.python.org/3/library/stdtypes.html#str.split

For n > 0, the resulting array has one less item than expected.

"a:b:c:d:e".split(":", 2)
# ['a', 'b', 'c:d:e']

hl.eval(hl.str("a:b:c:d:e").split(":", 2))
# ['a', 'b:c:d:e']

For n = 0, the resulting array has more than one item.

"a:b:c:d:e".split(":", 0)
# ['a:b:c:d:e']

hl.eval(hl.str("a:b:c:d:e").split(":", 0))
# ['a', 'b', 'c', 'd', 'e']

nawatts avatar Aug 31 '20 19:08 nawatts

I think we have this in our asana already, but good to have a GitHub issue too. What we can't immediately agree on is whether this is a fix or a breaking change.

tpoterba avatar Aug 31 '20 20:08 tpoterba

We could add a new function called like split_py that does the python thing, deprecate the old one, and swap in the future?

johnc1231 avatar Aug 31 '20 20:08 johnc1231

In case it's any help deciding on classification as a fix or breaking change, there was a past PR (#4741) for a similar issue that may provide precedent.

nawatts avatar Aug 31 '20 20:08 nawatts

Looking at our documentation, we document n as maximum number of splits. That makes this seem like a bug to me, especially in the 0 case. But clearly people use this function and this change is breaking to anyone who uses it. My vote would be to do the deprecation thing I suggested above.

johnc1231 avatar Aug 31 '20 20:08 johnc1231

Unearthing this old issue. I agree with John's assessment. We should:

  1. Fix the docs for n to indicate it is the maximum number of returned strings, not the maximum number of splits.
  2. Deprecate the n parameter. Do this with a warnings.warn when n is not None.
  3. Introduce a new parameter maxsplit which is the maximum number of splits to perform. Document that maxsplit is equal to the old n parameters minus one.

danking avatar Oct 23 '23 15:10 danking