dbplyr
dbplyr copied to clipboard
Support `str_detect(strings, fixed(pattern))` for all DB backends
dbplyr
tries to create a subset of the tidyverse that will work on both database backends, and in-memory tibble backends. Unfortunately there are currently no string-matching functions in that subset, because certain backends (e.g. SQLite) don't support regex by default.
My proposal is therefore to support the "fixed strings" versions of common string matching functions, because (to my knowledge), all SQL backends support the LIKE
operation.
The way I would envisage this working is:
DBI::dbConnect(RSQLite::SQLite(), ":memory:") |>
dplyr::tbl("table_name") |>
dplyr::filter(stringr::str_detect(some_column, fixed("foo")))
This would translate to something like SELECT * FROM table_name WHERE some_column LIKE %foo%
.
However, this would fail, because it requires full regex support:
DBI::dbConnect(RSQLite::SQLite(), ":memory:") |>
dplyr::tbl("table_name") |>
dplyr::filter(stringr::str_detect(some_column, "foo"))
I imagine this could be implemented using an sql_translator
input that provides an str_detect
implementation that fails when the input pattern is anything other than fixed
. In theory this could be extended to support base::grep(fixed=TRUE)
and other similar functions.
This came up when we were implementing some additional translations for Snowflake. The decision was to not translate grepl(..., fixed = TRUE)
to LIKE
, because LIKE
supports SQL wildcards and is therefore not a faithful implementation of fixed pattern matching. On Postgres, these are specifically %
and _
, matching zero or more characters and any single character, respectively. There seems to be some amount of backend dependency here, with MS Access using ?
and *
instead.
As an example, consider
grepl('5%', '50% of the berries', fixed = TRUE)
which would of course evaluate to FALSE
in the base
implementation, but would (perhaps surprisingly to the user) evaluate to TRUE
if translated to LIKE
on many backends.
To properly implement fixed pattern matching, I think we would need to include a bit of per-backend logic escaping any SQL wildcards. After doing a little research just now, I realize there are only two of them, so this wouldn't be terribly complex. I'm happy to submit a PR doing this for Snowflake, Postgres, Redshift, and SQLite if it would be of interest to the authors (@mgirlich ?)
Good points, and I agree that escaping the special characters should ultimately work. Plus, you only need to do this for backends that don't already support regex, so it's not as large a task as it may seem.
I agree this feature would be more valuable to backends without regex support, but think it would still be useful where regex is available; LIKE
is generally better performing due to its low complexity.
Incidentally I'm happy to help with this task if you would like, and if it gets the go ahead.
We already have this translated internally, as well as startsWith()
and endsWith()
. It looks roughly like:
#' Escape specials in input to SQL 'LIKE' clauses.
#' @noRd
# TODO: replace this with a raw string sequence once R>4 is the
# required, this is a multi-engine escape monstrosity.
like_escape <- function(x) gsub("([_%])", "\\\\\\\\\\1", x)
fixed_sentinel <- "<<___FIXED___>>"
strip_fixed_sentinel <- function(x) gsub(paste0("^", fixed_sentinel), "", x)
#' @export
sql_translation.*DBIConnection <- function(con) {
sql_variant(
sql_translator(
.parent = base_scalar,
startsWith = function(x, prefix) {
sql(sprintf(
"%s LIKE %s",
escape(x, con = con),
escape(paste0(like_escape(prefix), "%"), con = con)
))
},
endsWith = function(x, suffix) {
sql(sprintf(
"%s LIKE %s",
escape(x, con = con),
escape(paste0("%", like_escape(suffix)), con = con)
))
},
# stringr equivalents
# fixed comparison option for str_detect et al.
fixed = function(pattern, ignore_case = FALSE) {
if (ignore_case) {
# TODO: enable this as `LOWER(string) LIKE '%pattern%'`
stop("Case-insensitive fixed str_detect() is not yet supported.")
}
sql(paste0(fixed_sentinel, like_escape(pattern)))
},
str_detect = function(string, pattern) {
if (startsWith(pattern, fixed_sentinel)) {
pattern <- strip_fixed_sentinel(pattern)
sql(sprintf(
"%s LIKE '%%%s%%'",
escape(string, con = con),
escape(pattern, con = con)
))
}
# else regex equivalent
},
str_starts = function(string, pattern) {
if (startsWith(pattern, fixed_sentinel)) {
pattern <- strip_fixed_sentinel(pattern)
sql(sprintf(
"%s LIKE '%s%%'",
escape(string, con = con),
escape(pattern, con = con)
))
}
# else regex equivalent
},
str_ends = function(string, pattern) {
if (startsWith(pattern, fixed_sentinel)) {
pattern <- strip_fixed_sentinel(pattern)
sql(sprintf(
"%s LIKE '%%%s'",
escape(string, con = con),
escape(pattern, con = con)
))
}
# else regex equivalent
}
)
}
Happy to contribute this if it looks promising. One complication is that the regex equivalent is implementation-dependent... how will downstreams use the "official" fixed()
translation alongside their custom regex functions?
Thanks for working on this. I'd be happy to see a PR for this 😄
Feel free to assign me... I'll get to it... sometime soon 😄