dbplyr icon indicating copy to clipboard operation
dbplyr copied to clipboard

Support `str_detect(strings, fixed(pattern))` for all DB backends

Open multimeric opened this issue 1 year ago • 7 comments

dbplyr tries to create a subset of the tidyverse that will work on both database backends, and in-memory tibble backends. Unfortunately there are currently no string-matching functions in that subset, because certain backends (e.g. SQLite) don't support regex by default.

My proposal is therefore to support the "fixed strings" versions of common string matching functions, because (to my knowledge), all SQL backends support the LIKE operation.

The way I would envisage this working is:

DBI::dbConnect(RSQLite::SQLite(), ":memory:") |>
    dplyr::tbl("table_name") |>
    dplyr::filter(stringr::str_detect(some_column, fixed("foo")))

This would translate to something like SELECT * FROM table_name WHERE some_column LIKE %foo%.

However, this would fail, because it requires full regex support:

DBI::dbConnect(RSQLite::SQLite(), ":memory:") |>
    dplyr::tbl("table_name") |>
    dplyr::filter(stringr::str_detect(some_column, "foo"))

I imagine this could be implemented using an sql_translator input that provides an str_detect implementation that fails when the input pattern is anything other than fixed. In theory this could be extended to support base::grep(fixed=TRUE) and other similar functions.

multimeric avatar Sep 16 '22 07:09 multimeric

This came up when we were implementing some additional translations for Snowflake. The decision was to not translate grepl(..., fixed = TRUE) to LIKE, because LIKE supports SQL wildcards and is therefore not a faithful implementation of fixed pattern matching. On Postgres, these are specifically % and _, matching zero or more characters and any single character, respectively. There seems to be some amount of backend dependency here, with MS Access using ? and * instead.

As an example, consider

grepl('5%', '50% of the berries', fixed = TRUE)

which would of course evaluate to FALSE in the base implementation, but would (perhaps surprisingly to the user) evaluate to TRUE if translated to LIKE on many backends.

To properly implement fixed pattern matching, I think we would need to include a bit of per-backend logic escaping any SQL wildcards. After doing a little research just now, I realize there are only two of them, so this wouldn't be terribly complex. I'm happy to submit a PR doing this for Snowflake, Postgres, Redshift, and SQLite if it would be of interest to the authors (@mgirlich ?)

fh-afrachioni avatar Sep 18 '22 16:09 fh-afrachioni

Good points, and I agree that escaping the special characters should ultimately work. Plus, you only need to do this for backends that don't already support regex, so it's not as large a task as it may seem.

multimeric avatar Sep 18 '22 23:09 multimeric

I agree this feature would be more valuable to backends without regex support, but think it would still be useful where regex is available; LIKE is generally better performing due to its low complexity.

fh-afrachioni avatar Sep 18 '22 23:09 fh-afrachioni

Incidentally I'm happy to help with this task if you would like, and if it gets the go ahead.

multimeric avatar Sep 19 '22 01:09 multimeric

We already have this translated internally, as well as startsWith() and endsWith(). It looks roughly like:

#' Escape specials in input to SQL 'LIKE' clauses.
#' @noRd
# TODO: replace this with a raw string sequence once R>4 is the
#   required, this is a multi-engine escape monstrosity.
like_escape <- function(x) gsub("([_%])", "\\\\\\\\\\1", x)

fixed_sentinel <- "<<___FIXED___>>"
strip_fixed_sentinel <- function(x) gsub(paste0("^", fixed_sentinel), "", x)

#' @export
sql_translation.*DBIConnection <- function(con) {
  sql_variant(
    sql_translator(
      .parent = base_scalar,
      startsWith = function(x, prefix) {
        sql(sprintf(
          "%s LIKE %s",
          escape(x, con = con),
          escape(paste0(like_escape(prefix), "%"), con = con)
        ))
      },
      endsWith = function(x, suffix) {
        sql(sprintf(
          "%s LIKE %s",
          escape(x, con = con),
          escape(paste0("%", like_escape(suffix)), con = con)
        ))
      },

      # stringr equivalents
      # fixed comparison option for str_detect et al.
      fixed = function(pattern, ignore_case = FALSE) {
        if (ignore_case) {
          # TODO: enable this as `LOWER(string) LIKE '%pattern%'`
          stop("Case-insensitive fixed str_detect() is not yet supported.")
        }
        sql(paste0(fixed_sentinel, like_escape(pattern)))
      },
      str_detect = function(string, pattern) {
        if (startsWith(pattern, fixed_sentinel)) {
          pattern <- strip_fixed_sentinel(pattern)
          sql(sprintf(
            "%s LIKE '%%%s%%'",
            escape(string, con = con),
            escape(pattern, con = con)
          ))
        }
        # else regex equivalent
      },
      str_starts = function(string, pattern) {
        if (startsWith(pattern, fixed_sentinel)) {
          pattern <- strip_fixed_sentinel(pattern)
          sql(sprintf(
            "%s LIKE '%s%%'",
            escape(string, con = con),
            escape(pattern, con = con)
          ))
        }
        # else regex equivalent
      },
      str_ends = function(string, pattern) {
        if (startsWith(pattern, fixed_sentinel)) {
          pattern <- strip_fixed_sentinel(pattern)
          sql(sprintf(
            "%s LIKE '%%%s'",
            escape(string, con = con),
            escape(pattern, con = con)
          ))
        }
        # else regex equivalent
      }
  )
}

Happy to contribute this if it looks promising. One complication is that the regex equivalent is implementation-dependent... how will downstreams use the "official" fixed() translation alongside their custom regex functions?

MichaelChirico avatar Sep 20 '22 07:09 MichaelChirico

Thanks for working on this. I'd be happy to see a PR for this 😄

mgirlich avatar Sep 20 '22 10:09 mgirlich

Feel free to assign me... I'll get to it... sometime soon 😄

MichaelChirico avatar Sep 25 '22 04:09 MichaelChirico