cl-str icon indicating copy to clipboard operation
cl-str copied to clipboard

split string doesn't allow regular expression

Open mdbergmann opened this issue 3 years ago • 7 comments

See:

CL-USER> (str:split "\\n\\n" (str:join "" '("foo" #\newline #\newline "bar")))
("foo

bar")
CL-USER> (ppcre:split "\\n\\n" (str:join "" '("foo" #\newline #\newline "bar")))
("foo" "bar")
CL-USER> (ppcre:split "\\s\\s" (str:join "" '("foo" #\newline #\newline "bar")))
("foo" "bar")

Since under the hoods also ppcre is used it would be great to support splitting by regex. Maybe a separate function re-split?

mdbergmann avatar Dec 23 '20 08:12 mdbergmann

Hello, indeed, and that is a feature. str:split explicitly quotes meta characters to not allow regexps. It should be explicit with the documentation and the docstring.

And indeed, we can use ppcre:split for that (and we always can because ppcre is a dependency). At first sight, I find that adding re-split would not have added value and is not worth duplicating a function. Enhancing the README and the docstring to refer to ppcre would have been enough in your case?

vindarel avatar Dec 23 '20 10:12 vindarel

explicitly quotes meta characters to not allow regexps

Yeah. I've seen that. Had a glimpse at the sources.

Enhancing the README and the docstring to refer to ppcre would have been enough in your case?

Well. I guess it has to if you don't want to add it. I find it unfortunate however to fall back to ppcre directly to perform a split of a string which enforces me to mix namespaces of 'str' and 'ppcre' when only 'str' would suffice.

From an API perspective this could be controlled via key parameters, including the rsplit to just use split, for instance:

(split "o" "foo" :reverse)   ;; instead of `rsplit`

(split "o{2}" "foor" :regex)

Manfred

mdbergmann avatar Dec 23 '20 10:12 mdbergmann

if you don't want to add it.

I don't close the possibility.

I find it unfortunate however to fall back to ppcre directly to perform a split of a string which enforces me to mix namespaces of 'str' and 'ppcre' when only 'str' would suffice.

yeah I understand this too. But:

  • when we think "regexp", it might be best to turn to ppcre.
  • say we add re-split, then what if we want to, say, extract substrings matching a regexp? Or call starts-with-p but with a regexp? etc Are they valid use cases for this string library or light copies of pccre functionalities?

this could be controlled via key parameters

yes +1, we do this for some functions but it could be generalized.

vindarel avatar Dec 23 '20 13:12 vindarel

when we think "regexp", it might be best to turn to ppcre.

Regex is just a representation of an arbitrary string. The most flexible way to represent a string. Regex is not necessarily bound to ppcre. It just happens to be that ppcre is the library that 'understands' them. However, ppcre is much more low-level than str is.

I don't care so much whether it is a regex to use for splitting as long as I can use an arbitrary string. (Insofar I would probably refrain from re-split, but just have a split). I.e. splitting a text file with Windows line endings I have to use this work around.

(str:split (str:join "" '(#\return #\newline)) 
           (str:join "" '("foo" #\return #\newline "bar")))

I see string splitting essential for string parsing, it is kind of a light weight alternative to capturing (which really is about regexes) but in order to be useable for parsing it must allow arbitrary strings for splitting.

mdbergmann avatar Dec 23 '20 14:12 mdbergmann

Or call starts-with-p but with a regexp?

That's a valid point. What about other functionalities like 'starts-with', or 'ends-with'. My take is that those are much less dependent on regular expressions than splitting is. Though it might still be necessary to supply a tab character to a 'starts-with' function. I'm not sure if there is any other way of encoding special characters in a string so that it can be applied in 'starts-with', 'split', without using a regex.

mdbergmann avatar Dec 23 '20 16:12 mdbergmann

Thanks for detailing your use case and motivation.

(Insofar I would probably refrain from re-split, but just have a split).

split with a :regex (:re? both?) key would be good for you? That looks good, we should do it.

I.e. splitting a text file with Windows line endings I have to use this work around.

Here probably str should help and provide specific variables or function parameters. So you would not look for a regexp, but use a built-in explicitely.

Though it might still be necessary to supply a tab character to a 'starts-with' function.

+1, we should be able to give a character to starts-with-p, as with other functions.

vindarel avatar Jan 05 '21 12:01 vindarel

Hi.

split with a :regex (:re? both?) key would be good for you?

I would choose :regex

mdbergmann avatar Jan 05 '21 13:01 mdbergmann

So then str:split simply needs the :regex keyword parameter and an if clause like this?

(if regex
    (ppcre:split separator s :limit limit :start start :end end)
    (ppcre:split `(:sequence ,(string separator))
                 s
                 :limit limit :start start :end end)))

Or do we need more adjustments, such as support for the other ppcre:split parameters (with-registers-p, omit-unmatched-p, sharedp), or something else ? @mdbergmann @vindarel

kilianmh avatar Mar 21 '23 23:03 kilianmh

split, rsplit and split-omit-nulls with a :regex key argument is probably useful, although I didn't encounter the need.

An example I can think of:

(str:split "[0-9]+" "some987stupid123string" :regex t) ;; '(some stupid string)

vindarel avatar Apr 05 '23 17:04 vindarel

I have the same ideas about improving the str:split today. Instead of using the regex, I think separator can be a list that contains all the separators. Like (str:split '(";" "," " ") "some;thing, stupid ")

But looks like the regex is the more general way to improve. I am happy with importing the :regex keyword.

Update: Gave a PR for split regex. https://github.com/vindarel/cl-str/pull/110

ccqpein avatar Dec 02 '23 19:12 ccqpein

thanks for doing it!

May it serve you well for advent of code ;)

vindarel avatar Dec 07 '23 18:12 vindarel

Ah! @vindarel it is you made this tool. I thought the id is familiar!

ccqpein avatar Dec 07 '23 19:12 ccqpein