dbplyr
dbplyr copied to clipboard
Produce shorter SQL for multiple mutates
The generated SQL is way more nested than it would have to be. It would be great if we could shorten the queries
Nested Mutates
This is probably not very common but maybe there is an easy fix.
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
lf <- lazy_frame(x = 1, y = 2)
lf %>%
mutate(
x = x + 1,
x = x + 1,
y = y + 1,
y = y + 1
)
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`) `q02`
# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`) `q01`
Group Followed by Multiple Mutates
lf %>%
group_by(x) %>%
mutate(x = x + 1) %>%
mutate(y = y + 1)
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`
# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`
Mutate Followed by Filter
Fixed by #792
lf %>%
mutate(x = x + 1) %>%
filter(y == 1)
#> <SQL>
#> SELECT *
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`
#> WHERE (`y` = 1.0)
# could be
#> <SQL>
#> SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`
#> WHERE (`y` = 1.0)
Joins
Moved to its own issue #722 This feels pretty common to me
your last one is easily solved by (i.e. switching around filter and mutate):
lf %>%
filter(y == 1)%>%
mutate(x = x + 1) %>% show_query()
I'm very interested for optimisation for the group_by clause though, because that
memdb_frame(x = c(1,1,1,1), y = c(2,2,3,3), z=c(1,2,3,4)) %>%
filter(x < 2) %>%
group_by(y)%>%summarise(x1=sum(z))%>%
select(y, x1) %>%
show_query()
produces
<SQL>
SELECT `y`, `x1`
FROM (SELECT `y`, SUM(`z`) AS `x1`
FROM `dbplyr_023`
WHERE (`x` < 2.0)
GROUP BY `y`)
but it really should be:
<SQL>
SELECT `y`, SUM(`z`) AS `x1`
FROM `dbplyr_023`
WHERE (`x` < 2.0)
GROUP BY `y`
This workaround worked for me.
You can first build the list of expressions (that would go into each mutate()
), then apply all at once.
This produces a single "layer" of SELECT ...
.
## First, generate the commands.
commands <- purrr::map2(variable_names, sql_code,
~rlang::exprs(!!..1 := !!dplyr::sql(..2))) |>
unlist(recursive = FALSE)
## Then, apply all at once.
lazy_table <- dplyr::mutate(lazy_table,
!!!commands)