hail
hail copied to clipboard
String split behavior inconsistent with Python
Hail version: 0.2.55-0d4ce0df2457
The behavior of the second parameter to StringExpression's split method is determined by its Java implementation and differs from Python string's split method.
The limit parameter controls the number of times the pattern is applied and therefore affects the length of the resulting array. If the limit n is greater than zero then the pattern will be applied at most n - 1 times, the array's length will be no greater than n, and the array's last entry will contain all input beyond the last matched delimiter. If n is non-positive then the pattern will be applied as many times as possible and the array can have any length. If n is zero then the pattern will be applied as many times as possible, the array can have any length, and trailing empty strings will be discarded. -- https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#split-java.lang.String-int-
vs
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements). If maxsplit is not specified or -1, then there is no limit on the number of splits (all possible splits are made). -- https://docs.python.org/3/library/stdtypes.html#str.split
For n
> 0, the resulting array has one less item than expected.
"a:b:c:d:e".split(":", 2)
# ['a', 'b', 'c:d:e']
hl.eval(hl.str("a:b:c:d:e").split(":", 2))
# ['a', 'b:c:d:e']
For n
= 0, the resulting array has more than one item.
"a:b:c:d:e".split(":", 0)
# ['a:b:c:d:e']
hl.eval(hl.str("a:b:c:d:e").split(":", 0))
# ['a', 'b', 'c', 'd', 'e']
I think we have this in our asana already, but good to have a GitHub issue too. What we can't immediately agree on is whether this is a fix or a breaking change.
We could add a new function called like split_py
that does the python thing, deprecate the old one, and swap in the future?
In case it's any help deciding on classification as a fix or breaking change, there was a past PR (#4741) for a similar issue that may provide precedent.
Looking at our documentation, we document n
as maximum number of splits
. That makes this seem like a bug to me, especially in the 0 case. But clearly people use this function and this change is breaking to anyone who uses it. My vote would be to do the deprecation thing I suggested above.
Unearthing this old issue. I agree with John's assessment. We should:
- Fix the docs for
n
to indicate it is the maximum number of returned strings, not the maximum number of splits. - Deprecate the
n
parameter. Do this with awarnings.warn
whenn
is notNone
. - Introduce a new parameter
maxsplit
which is the maximum number of splits to perform. Document thatmaxsplit
is equal to the oldn
parameters minus one.