ocaml-re
ocaml-re copied to clipboard
Re.split odd behaviour with separator at beginning/end
Consider the following toplevel session:
# let rex = Re.(compile (alt [char '\n'; str "\r\n"]));;
val rex : Re.re = <abstr>
# Re.split rex "hello\n\nworld";;
- : string list = ["hello"; ""; "world"]
# Re.split rex "\nhello\n\nworld\n";;
- : string list = ["hello"; ""; "world"]
# Re.split rex "\n\nhello\n\nworld\n\n";;
- : string list = [""; "hello"; ""; "world"; ""]
I understand that Re.split's proper behaviour in this case -- when a separator occurs at the very beginning or at the very end of a string -- is open to discussion. Nevertheless, the currently implemented behaviour as shown above strikes me as odd: if the number of separators at the beginning is 0, 1, and 2, the number of empty elements will be 0, 0, and 1, respectively.
I think it makes more sense for a single separator at the start to produce a list whose first element is empty. Likewise, a single separator at the end should produce an empty last element. In other words, if the number of separators at the beginning is 0, 1, and 2, the number of empty elements should also be 0, 1, and 2, respectively.
I've encountered this issue in practice while porting from OCaml-pcre to OCaml-re, and coding around it is a major PITA.
Incidentally, the behaviour of the PCRE library is more useful in practice, though also a bit quirky: for separators at the beginning of a string it behaves like my proposal above; however, any number of separators at the end are simply discarded.
Agreed (having just also been stung by this). It's surprising that let split_on_char c = Re.(split (compile (char c)) is not equivalent to OCaml's String.split_on_char (not only the above examples, but also passing "" to Re.split returns an empty list).
#233 adds a Re.split_delim function which behaves as you expect.