julia `split` handles empty fields inconsistently

Consider:

julia> print(split("spaces    here", limit=2))
SubString{String}["spaces", "   here"]

Is this result correct, or a bug? It's not clear. On one hand, there is no direct, unambiguous violation of the docstring here. It does indeed split on isspace, and does not keep any empty fields.

On the other hand, it's highly unintuitive, and clearly not what the user meant. Furthermore, it's inconsistent with this result:

julia> print(split("spaces    here", limit=3))
SubString{String}["spaces", "here"]

Bumping the limit from 2 to 3 where there are only two fields should not change behaviour.

I suggest fixing this by changing the behaviour of EachSplitIterator, so that it does not finish early if keepempty is false.

Edit: After some thought, I'd argue this is buggy behaviour. The docstring states that it splits on the occurrences of dlm, which clearly means all occurrences of dlm. The current implementation fails to split on any trailing occurrences, e.g.

julia> split("X   ", limit=1)
1-element Vector{SubString{String}}:
 "X   "

is wrong.

Jul 03 '22 16:07 jakobnissen

I can make a PR if it's agreed that current behaviour is a bug, but since current behaviour is explicitly tested for, I guess some people think it's correct behaviour?

Jul 03 '22 16:07 jakobnissen

The parameter is to limit the output to limit elements, not limit splits, so split needs to do limit-1 non-empty splits, because the last element has to be reserved for the remainder of the string after the last split.

So the example with limit=2 does one split and stops which leaves the last element of the result with one less space at its start. The default delimiter is a single "space", not a sequence of spaces.

But the example with limit=3 tries to do two splits so it consumes all the empty fields delimited by all those spaces between the non-empty ones, and has no remainder to put in the third element so it is omitted.

So the difference between limit=2/3 is expected.

Its not documented if empty fields following limit-1 splits (ie at the start of the remainder) will be removed.

Personally I would have expected not, it allows another split (possibly with different parameters, eg keep=true) to be used on the remainder of the string.

Jul 04 '22 01:07 elextr

If the example with limit=3 will split indefinitely, then it is still buggy because it violates your assumption that it will do limit-1 split

Jul 04 '22 05:07 jakobnissen

your assumption that it will do limit-1 split

Thats not my assumption, and why I started the response with "The parameter is to limit the output to limit elements, not limit splits".

[Edit: or another way to say it is it will do infinite empty splits, but limit-1 non-empty splits]

Jul 04 '22 08:07 elextr

The parameter is to limit the output to limit elements, not limit splits, so split needs to do limit-1 non-empty splits

Indeed. And would you not agree that it is implicit that split will do as many splits as possible, while still staying below limit elements? I.e. if split("a b c", limit=5) resulted in ["a", "b c"], that would be a bug, even if technically, the docstring was not violated (it did split on space, and resulted in no more than 5 elements)?

So if we can agree that split ought to do all splits until it would exceed limit, then in fact split("spaces here", limit=2) ought to keep splitting until it has two kept elements, i.e. it ought to return ["spaces", "here"]. Otherwise it makes the same mistake as my previous example, namely that it stops prematurely, as it would have been able to just continue splitting without exceeding the limit of 2 elements.

Jul 04 '22 08:07 jakobnissen

So if we can agree that split ought to do all splits until it would exceed limit, then in fact split("spaces here", limit=2) ought to keep splitting until it has two kept elements, i.e. it ought to return ["spaces", "here"]. Otherwise it makes the same mistake as my previous example, namely that it stops prematurely, as it would have been able to just continue splitting without exceeding the limit of 2 elements.

Agree, but for limit=2 one non-empty split gives two elements, so there is no reason to do more splits.

Jul 04 '22 08:07 elextr

Ah, so I guess we disagree on whether split should try to do as FEW splits as possible to obtain limit elements (what I take it you believe), or as MANY splits as possible, not exceeding limit elements (what I believe). I would argue when one is told that split splits on the occurrences of dlm, it is implied that it splits on all occurrences from left to right (unless limit would be exceeded). That is, it tries to do the same as if limit had not been passed as a keyword, except it stops if it would otherwise exceed limit. That seems the most unsurprising behaviour in my opinion

At any rate, this should be decided on and documented, since it's currently ambiguous. I would also argue the current behaviour is unintuitive, and it has the undesirable behaviour that split(x, limit=2) can be different from split(x, limit=3), even when both return 2 elements.

Jul 04 '22 08:07 jakobnissen

Yes, you have explained the difference between the current behaviour and your feature request correctly.

What would you expect:

print(split("spaces    here    and    here   ", limit=3))

to produce?

Jul 04 '22 08:07 elextr

This is not a feature request, we're arguing if it's a bug - or at least an inconsistent, surprising behaviour in an unspecified edge case that should be changed. If it was a feature, it'd just be a breaking change. I would expect split("spaces here and here ", limit=3) to result in ["spaces", "here", "and here "], i.e:

It splits all spaces before "and", as expected
It does not split after "and", since doing so would exceed limit
It does not split the trailing whitespace after the last "here", since that would violate the idea that it splits from left to right (which is implied in the limit argument and the existence of rsplit, i.e. returning ["spaces here", "and", "here"] is obviously wrong)

Jul 04 '22 09:07 jakobnissen

You said the current behaviour is tested (I was lazy and didn't check), so at least there is some expectation that the behaviour is intended.

Jul 04 '22 09:07 elextr

Hence @jakobnissen 's comment

I can make a PR if it's agreed that current behaviour is a bug, but since current behaviour is explicitly tested for, I guess some people think it's correct behaviour?

I agree with @jakobnissen that this seems like a bug. When I read

Split str into an array of substrings on occurrences of the delimiter(s) dlm...

limit: the maximum size of the result. limit=0 implies no maximum (default)

in the docstring, it implies to me that when a split happens resulting in an empty string, that should be discarded before checking the length of the result rather than after. In other words, the logic should basically be:

split
discard if empty
stop if length(result) >= limit or EOS
repeat

Whereas currently it's

split
stop if length(result) >= limit or EOS
discard if empty
repeat

Jul 05 '22 15:07 kescobo

I don't think this behavior is wrong or unclear currently: the last entry is taken to be the remainder of the string taken after the last (n-1) split, but it might be discarded if keepempty=true. However, since the main reason for that behavior (strtok-like) is probably identical to adding keepempty=true, perhaps this can be changed so that these work differently:

julia> print(split("spaces     ", limit=2, keepempty=false))
SubString{String}["spaces", "    "]
julia> print(split("spaces     ", limit=2, keepempty=true))
SubString{String}["spaces", "    "]

(perhaps the default value for keepempty should be changed from false to limit > 2?)

Jul 07 '22 20:07 vtjnash

Interestingly, coming to think of it, I can see situations where both behaviours would be useful:

When splitting a TSV file with T columns, one can extract the first N-1 columns and the other T-N+1 with the current behaviour. Current behaviour guarantees the remainder has T-N+1 columns, even if these are empty.
When splitting whitespace, it's reasonable to assume that each element is not padded by whitespace. Current behaviour makes it so that it sometimes is.

Maybe it's not a coincidence that Rust has split, split_once, and split_whitespace which behave differently.. I recommend anyone read their behaviour and docs - not that we need to mirror it, but it's quite simple, unambiguous and intuitive.

Jul 08 '22 06:07 jakobnissen

Python split() also handles whitespace specially if no alternate separator is given, and it limits splits, not results, which may be more intuitive. But has no option to drop empty fields.

Jul 08 '22 07:07 elextr

Hello, I believe the default split (not specifying the delimiter) should behave as closely as possible to (am I allowed to say the word here?) Python. So

split("a  b c") == ["a", "b", "c"]
split("a  b c", limit=2) == ["a", "b c"] ## not the case now
split("a  b  c", limit=2) == ["a", "b  c"] ## not the case now

I really think that whitespace-delimited splits (which IMHO should be the default) should be treated separately from character or string delimited splits. If there really is a use case for

split("a  b c") == ["a", "", "b", "c"] ## all spaces matter

then I wouldn't mind if the user has to specify dlm=' ' explicitly. From the manual it appears that keepempty= tries to sort-of implement this, but see the original post.

I recall having a similar discussion about the readdlm() in the Julia 0.3 era, and then I managed to get a PR through with special treatment of the split-by-whitespace case, but now I see this re-appear in base Julia split()...

Aug 08 '22 09:08 davidavdav

(am I allowed to say the word here?) Python.

You are certainly allowed, just be aware that the argument "to be like python" is unlikely to carry weight - if python does "the right thing" and you can argue why, that's different.

I really think that whitespace-delimited splits (which IMHO should be the default) should be treated separately from character or string delimited splits.

While I'm not sure I disagree, I suspect that any argument for inconsistency based on delimiter (at least within the same function) is going to be rejected.

Aug 08 '22 14:08 kescobo