dbplyr Produce shorter SQL for multiple mutates

Produce shorter SQL for multiple mutates

Open mgirlich opened this issue 2 years ago • 2 comments

The generated SQL is way more nested than it would have to be. It would be great if we could shorten the queries

Nested Mutates

This is probably not very common but maybe there is an easy fix.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

lf <- lazy_frame(x = 1, y = 2)
lf %>% 
  mutate(
    x = x + 1,
    x = x + 1,
    y = y + 1,
    y = y + 1
  )
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`) `q02`

# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`) `q01`

Group Followed by Multiple Mutates

lf %>% 
  group_by(x) %>% 
  mutate(x = x + 1) %>% 
  mutate(y = y + 1)
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`

# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`

Mutate Followed by Filter

Fixed by #792

lf %>% 
  mutate(x = x + 1) %>% 
  filter(y == 1)
#> <SQL>
#> SELECT *
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`
#> WHERE (`y` = 1.0)

# could be
#> <SQL>
#> SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`
#> WHERE (`y` = 1.0)

Joins

Moved to its own issue #722 This feels pretty common to me

Nov 04 '21 10:11 mgirlich

your last one is easily solved by (i.e. switching around filter and mutate):

lf %>%  
   filter(y == 1)%>%
   mutate(x = x + 1) %>% show_query()

I'm very interested for optimisation for the group_by clause though, because that

   memdb_frame(x = c(1,1,1,1), y = c(2,2,3,3), z=c(1,2,3,4)) %>% 
   filter(x < 2) %>% 
   group_by(y)%>%summarise(x1=sum(z))%>%
   select(y, x1) %>% 
   show_query()

produces

  <SQL>
 SELECT `y`, `x1`
 FROM (SELECT `y`, SUM(`z`) AS `x1`
 FROM `dbplyr_023`
 WHERE (`x` < 2.0)
 GROUP BY `y`)

but it really should be:

 <SQL>
 SELECT `y`, SUM(`z`) AS `x1`
 FROM `dbplyr_023`
 WHERE (`x` < 2.0)
 GROUP BY `y`

Nov 05 '21 10:11 andia89

This workaround worked for me.

You can first build the list of expressions (that would go into each mutate()), then apply all at once.

This produces a single "layer" of SELECT ....

## First, generate the commands.
commands <- purrr::map2(variable_names, sql_code,
  ~rlang::exprs(!!..1 := !!dplyr::sql(..2))) |>
  unlist(recursive = FALSE)

## Then, apply all at once.
lazy_table <- dplyr::mutate(lazy_table,
  !!!commands)

Jan 14 '23 03:01 fabkury

dbplyr dbplyr copied to clipboard

Produce shorter SQL for multiple mutates

Nested Mutates

Group Followed by Multiple Mutates

Mutate Followed by Filter

Joins

dbplyr
dbplyr copied to clipboard