RClickhouse icon indicating copy to clipboard operation
RClickhouse copied to clipboard

Semi join

Open lucasxteixeira opened this issue 1 year ago • 1 comments

Hi,

I recently encountered an issue while using dplyr::semi_join with Clickhouse. The default code generated by dplyr produces a subquery with dependencies, and this isn't supported in Clickhouse (or am I wrong?). However, I noticed that Clickhouse does support LEFT SEMI JOIN. Consequently, I've wrote the following function to address this:

#' @export
#' @importFrom dbplyr sql_query_semi_join
sql_query_semi_join.ClickhouseConnection <- function(con, x, y, anti, by, where, vars, ..., lvl = 0) {

  x <- dbplyr:::dbplyr_sql_subquery(con, x, name = by$x_as, lvl = lvl)
  y <- dbplyr:::dbplyr_sql_subquery(con, y, name = by$y_as, lvl = lvl)

  on <- dbplyr:::sql_join_tbls(con, by)

  JOIN <- ifelse(anti, dplyr::sql("ANTI LEFT JOIN"), dplyr::sql("SEMI LEFT JOIN"))

  # Wrap with SELECT since callers assume a valid query is returned
  clauses <- list(
    dbplyr:::sql_clause_select(con, vars),
    dbplyr:::sql_clause_from(x),
    dbplyr:::sql_clause(JOIN, y),
    dbplyr:::sql_clause("ON", on, sep = " AND", parens = TRUE, lvl = 1)
  )
  dbplyr:::sql_format_clauses(clauses, lvl, con)
}

Nonetheless, I'm aware that my function uses some internal dbplyr functions, and I'm uncertain about the permissibility of this approach. Could someone provide some directions on how to refine this function for a potential PR?

Thank you in advance.

lucasxteixeira avatar Aug 06 '23 11:08 lucasxteixeira

Hi Lucas! We appreciate your help! The package comprises two parts: (1) the Clickhouse-Driver and (2) the dbplyr interface. Your code does dplyr-code generation and, thus, targets the second part. We started to develop the package at a time when Clickhouse's Join-support was super basic, not in line with the standards, and highly restricted, but that changed a lot. Consequently, you can do a much better job than we did.

I am thrilled that you want to invest time here, and a pull request is welcome. Ideally, you would adapt the R/dplyr.R file, as this is where the dplyr-specific code is located. But here are a few more settings to keep in mind:

  • The code generation should be in line with the settings. Not all join algorithms support all join clauses.
  • Moreover, the function should also consider ANY/ALL (just in case someone changed join_default_strictness).
  • Please keep in mind that also any_join_distinct_right_table_keys is likely to affect the semantics.
  • Also test your code with the new planner/analyzer (allow_experimental_analyzer). ClickHouse#42648

inkrement avatar Aug 14 '23 06:08 inkrement