RVerbalExpressions
RVerbalExpressions copied to clipboard
Character sets
Problem
I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in rx
:
Challenges
First of all, I dont know of the way to express single "word" character (alnum
+ _
). We used rx_word
to denote \\w+
and perhaps it should have been rx_word_char() %>% rx_one_or_more()
.
rx_char <- function(.data = NULL, value=NULL) {
if(missing(value))
return(paste0(.data, "\\w"))
paste0(.data, sanitize(value))
}
I also extended rx_count
to cases of ranges of input
rx_count <- function(.data = NULL, n = 1) {
if(length(n)>1){
n[is.na(n)]<-""
return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
}
paste0(.data, "{", n,"}")
}
Finally, we dont have a way to express word boundaries (\\b
) and it might be useful to denote them. We shall call this function rx_word_edge
rx_word_start <- function(.data = NULL){
paste0(.data, "\\b")
}
rx_word_end <- rx_word_start
Finally, our biggest problem is that there's no way to express groups of characters, other than through rx_any_of()
, but if we pass other rx
expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.
# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
paste0(.data, "[", value, "]")
}
Solution
Here's what it looks like when we put all pieces together:
x <- rx_word_start() %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".%+-")
) %>%
rx_one_or_more() %>%
rx_char("@") %>%
rx_group(
rx() %>%
rx_char() %>%
rx_char(".-")
) %>%
rx_one_or_more() %>%
rx_char(".") %>%
rx_alpha() %>%
rx_count(2:6) %>%
rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"
txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "[email protected]" "[email protected]"
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "[email protected]" "[email protected]"
The code works but I don't like it.
- Constructor
rx
look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below). - It is not very clear what
rx_one_or_more()
is referring to. I wonder if all functions should haverep
argument with default optionone
and optionssome
/any
in addition to whatrx_count
does today. - Should
rx_char()
without arguments be calledrx_wordchar
? - Should
rx_char()
with arguments be calledrx_literal()
orrx_plain
? - We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
-
rx_group
is artificial construct, a duplicate ofrx_any_of
, but without sanitization. Here I see couple of solutions. a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type ofvalue
argument is not character, butrx_string
. Input of this class do not need to be sanitized, because it has been sanitized at creation. b. Do not allow "nested pipes". Instead definerx_any_of()
to have...
and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
rx_word_edge() %>%
rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
rx_literal("@") %>%
rx_any_of(rx_wordchar(), ".-", rep="some") %>%
rx_literal(".") %>%
rx_alpha(rep=2:6) %>%
rx_word_edge()
It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.
Hi @dmi3kno I'm going to try and summarize just to make sure I understand this all.
Regarding the challenges:
-
No function to express individual word characters (
[[:alnum:]]
+_
). Solution is to addrx_word_chr() %>% rx_one_or_more()
combo (or add arep
argument to everything). The code chunk withrx_char()
, that's actually referring torx_word_char()
right? It should be becauserx_char()
sounds misleading if it's\\w
under the hood (i.e.!
is a character but wont be matched). -
rx_count()
accepts ranges, nice addition 👍 -
rx_word_edge()
need word boundaries. Same as in point 1, I assume the code chunk just wasn't edited with the name change. Word boundaries will be expressed asrx_word_edge()
. There is also\\B
(not word edge), should we support!rx_word_edge()
? -
The biggest problem is groups of characters, double sanitization becomes an issue if you want to use
rx_
inside anrx_
call. Solution is createrx_group()
, identical torx_any_of()
but doesn't sanitize. Side note,rx_any_of()
might better be represented as[whatever]*
. Additionally, this function might better be namedrx_one_of()
since it matches one of the characters in the set (i.e. gr[ae]y).
Regarding the solution:
-
rx()
is redundant but I couldn't get away from needing to pass value parameter at the start sorx
was the quick and dirty solution. So more than happy to find a more elegant solution to this. Is it the S3 class you mention or the...
or both? -
rx_one_or_more()
isn't very clear in the nested pipes example. Using the example from your pull request, am I on the right path with this translation:
# old
x <- rx_word_edge() %>%
rx_alpha() %>%
rx_one_or_more() %>%
rx_word_edge()
# new
x <- rx_word_edge() %>%
rx_alpha(rep = "any") %>%
rx_word_edge()
-
Yes,
rx_char()
(or better yet,rx_literal()
as you mention) implies things other than word characters so without an argument it should berx_word_char()
given that this would return\\w
. -
If
rx_char()
behaves like I think it does in the example,rx_literal()
sounds most fitting to me.rx_literal("@")
literally gives you @ and nothing more. -
Letting the user know something is going to be sanitized sounds good but might use different words like "special characters will be escaped" or something, don't know if that's clearer to someone (including myself 😅) without much regex knowledge.
-
I do not like nested pipes, I would prefer to avoid that! The second solution looks much cleaner.
With the latest version of RVerbalExpressions
and some of the functions you wrote, the closest I can get without using the rep
argument is:
library(RVerbalExpressions)
rx_word_char <- function(.data = NULL, value = NULL) {
if(missing(value))
return(paste0(.data, "\\w"))
paste0(.data, sanitize(value))
}
rx_group <- function(.data = NULL, value) {
paste0(.data, "[", value, "]")
}
rx_any_of <- function(.data = NULL, value, ...) {
if(missing(...))
return(paste0(.data, "[", sanitize(value), "]"))
paste0(.data, "[", value, sanitize(...), "]")
}
rx_literal <- function(.data = NULL, value) {
paste0(.data, value)
}
x <- rx_word_edge() %>%
rx_any_of(rx_word_char(), ".%+-") %>%
rx_one_or_more() %>%
rx_literal("@") %>%
rx_any_of(rx_word_char(), ".-") %>%
rx_one_or_more() %>%
rx_word_char(".") %>%
rx_alpha() %>%
rx_count(n = 2:6) %>%
rx_word_edge()
txt <- "This text contains email [email protected] and [email protected]. The latter is no longer valid."
stringr::str_extract_all(txt, x)[[1]]
#> [1] "[email protected]" "[email protected]"
Looking at that long pipe makes the rep
argument worth it to me. This would avoid 3 lines (lines 3, 6, and 9).
Sorry for messy post. I was writing it and contributing new functions at the same time, so it reflects my own evolution of thinking. I will be more consistent going forward.
- I really like the idea of supporting (some of) the negated operations, e.g.
!rx_word_edge()
. I have never implemented overloading of!
. On the first look it seems to require custom class, which is what I believe we should do anyways. But I also cant recall seeing!
with the pipe. How would that work? -
rx_one_of()
:+1:. I am on it! - On your interpretation of
rep
argument: :100:.rep
seems like modifier to me (similar torx_count
, so that rep=any
(quantifier*
) isrx_count(c(0, NA))
or{0,}
and rep=some
(quantifier+
) is the same asrx_count(c(1, NA))
or{1,}
). If we go down this route,rep
argument could be recepticle for both wildcard quantifiers and counts and it is not unthinkable to express:
# the following is equivalent to `[a-zA-Z]*?`
rx_alpha(rep="any", mode="lazy")
- The only thing I feel we should watch out for, is that by adding all of these modifying arguments we are on the way back to complexity and away from intuitive interface. So I say we keep both
rx_one_or_more()
andrx_none_or_more()
as well as implement more conciserep
interface. -
rx_word_char(.data=NULL)
shall only mean\\w
. Having said that, I findrx_word_char
slightly confusing. I actually find the whole\\w
confusing and remember looking up multiple times what characters are included. Imagine the world where we haverx_alphanum()
andrx_alpha_num()
with the latter also including_
. (Or mayberx_alnum()
andrx_alnum_()
, although people might confuse trailing_
with standard evaluation, thank youdplyr
). -
rx_literal()
:+1: I am on it !
- Even though I used
rx_group()
in my initial post, I don't like it. As I said, nested pipes are confusing. So let's go down the route of...
and parsing content class-dependent. I see you suggestedrx_any_of (.data, value, ...)
, but that's not what I am talking about.
rx_one_of <- function(.data = NULL, ... ) {
args <- sapply(list(...), function(x) if(inherits(x, "rx_string")) x else sanitize(x))
args_str <- Reduce(paste0, args)
paste0(.data, "[", args_str, "]")
}
This would require custom class to be output by every of our functions:
rx_word_char <- function(.data = NULL) {
res <- paste0(.data, "\\w")
class(res) <- unique(c("rx_string", class(res))) # to avoid accidental double "classing"
res
}
rx_literal <- function(.data=NULL, value) {
res <- paste0(.data, sanitize(value))
class(res) <- unique(c("rx_string", class(res))) # to avoid accidental double "classing"
res
}
But then you can do things like:
rx() %>%
rx_one_of(rx_word_char(), rx_literal(value="?"), "abc")
#> [1] "[\\w\\?abc]"
The only thing I feel we should watch out for, is that by adding all of these modifying arguments we are on the way back to complexity and away from intuitive interface. So I say we keep both rx_one_or_more() and rx_none_or_more() as well as implement more concise rep interface.
100% agree, I would rather have an intuitive API that does less rather than a somewhat clunky API that can do a whole lot. Given the number of functions that have been added, I wonder if a vignette covering common regex use cases and which functions to use would be helpful?
-
!
with the pipe is something I've never seen either and I just realized that once you mentioned it. If it is possible, most likely way beyond me. -
I like
rep
, it should be there and I agree thatrx_one_or_more
andrx_none_or_more
stay. I usually don't like the idea of aliases or multiple functions that provide the same functionality but for the sake of keeping the functions from the original JS repo and having a more verbose option, it is worth while to keep. -
Now that you mentioned,
rx_word_char
is a little confusing, it's basicallyrx_alnum
+ _. I thinkrx_alpha_num()
sounds the best. It might not immediately be clear what is does but if the docs quickly express alphabet + underscore + numbers I think it should be clear, easy to remember. -
The last part using the ellipses:
rx_one_of <- function(.data = NULL, ... ) {
args <- sapply(list(...), function(x) if(inherits(x, "rx_string")) x else sanitize(x))
args_str <- Reduce(paste0, args)
paste0(.data, "[", args_str, "]")
}
Looks great, I haven't done much or any programming using ellipses but this looks much more elegant! Very excited about this.
To do here:
- [x] Implement
rx_literal
- [ ] Implement new class
rx_string
. Make a class constructornew_rx
(unexported). Implement crucial methods for vectorized class (ref. Hadley) - [ ] Use rx_literal as example to reimplement other multi-argument functions into method dispatch.
- [ ] Add
rep
argument to most(?) functions. Options (integer vector/"any"/"some"). - [x] add 'posessive' option to
mode
argument - [x]
rx_none_of
as inverse ofrx_one_of
. This might be overlappingrx_anything_but
- [x]
rx_alnum
renamed torx_alphanum
, andrx_word_char
torx_alpha_num
- [x] Implement
rx_one_of
with...
. Deprecaterx_any_of
- [ ] ~~Reimplement
rx_find
with...
~~