RClickhouse
RClickhouse copied to clipboard
Semi join
Hi,
I recently encountered an issue while using dplyr::semi_join with Clickhouse. The default code generated by dplyr produces a subquery with dependencies, and this isn't supported in Clickhouse (or am I wrong?). However, I noticed that Clickhouse does support LEFT SEMI JOIN. Consequently, I've wrote the following function to address this:
#' @export
#' @importFrom dbplyr sql_query_semi_join
sql_query_semi_join.ClickhouseConnection <- function(con, x, y, anti, by, where, vars, ..., lvl = 0) {
x <- dbplyr:::dbplyr_sql_subquery(con, x, name = by$x_as, lvl = lvl)
y <- dbplyr:::dbplyr_sql_subquery(con, y, name = by$y_as, lvl = lvl)
on <- dbplyr:::sql_join_tbls(con, by)
JOIN <- ifelse(anti, dplyr::sql("ANTI LEFT JOIN"), dplyr::sql("SEMI LEFT JOIN"))
# Wrap with SELECT since callers assume a valid query is returned
clauses <- list(
dbplyr:::sql_clause_select(con, vars),
dbplyr:::sql_clause_from(x),
dbplyr:::sql_clause(JOIN, y),
dbplyr:::sql_clause("ON", on, sep = " AND", parens = TRUE, lvl = 1)
)
dbplyr:::sql_format_clauses(clauses, lvl, con)
}
Nonetheless, I'm aware that my function uses some internal dbplyr functions, and I'm uncertain about the permissibility of this approach. Could someone provide some directions on how to refine this function for a potential PR?
Thank you in advance.
Hi Lucas! We appreciate your help! The package comprises two parts: (1) the Clickhouse-Driver and (2) the dbplyr interface. Your code does dplyr-code generation and, thus, targets the second part. We started to develop the package at a time when Clickhouse's Join-support was super basic, not in line with the standards, and highly restricted, but that changed a lot. Consequently, you can do a much better job than we did.
I am thrilled that you want to invest time here, and a pull request is welcome. Ideally, you would adapt the R/dplyr.R file, as this is where the dplyr-specific code is located. But here are a few more settings to keep in mind:
- The code generation should be in line with the settings. Not all join algorithms support all join clauses.
- Moreover, the function should also consider ANY/ALL (just in case someone changed join_default_strictness).
- Please keep in mind that also any_join_distinct_right_table_keys is likely to affect the semantics.
- Also test your code with the new planner/analyzer (allow_experimental_analyzer). ClickHouse#42648