dbplyr icon indicating copy to clipboard operation
dbplyr copied to clipboard

Produce shorter SQL for multiple mutates

Open mgirlich opened this issue 2 years ago • 2 comments

The generated SQL is way more nested than it would have to be. It would be great if we could shorten the queries

Nested Mutates

This is probably not very common but maybe there is an easy fix.

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

lf <- lazy_frame(x = 1, y = 2)
lf %>% 
  mutate(
    x = x + 1,
    x = x + 1,
    y = y + 1,
    y = y + 1
  )
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`) `q02`

# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM (SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`) `q01`

Group Followed by Multiple Mutates

lf %>% 
  group_by(x) %>% 
  mutate(x = x + 1) %>% 
  mutate(y = y + 1)
#> <SQL>
#> SELECT `x`, `y` + 1.0 AS `y`
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`

# could be
# SELECT `x` + 1.0 AS `x`, `y` + 1.0 AS `y`
# FROM `df`

Mutate Followed by Filter

Fixed by #792

lf %>% 
  mutate(x = x + 1) %>% 
  filter(y == 1)
#> <SQL>
#> SELECT *
#> FROM (SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`) `q01`
#> WHERE (`y` = 1.0)

# could be
#> <SQL>
#> SELECT `x` + 1.0 AS `x`, `y`
#> FROM `df`
#> WHERE (`y` = 1.0)

Joins

Moved to its own issue #722 This feels pretty common to me

mgirlich avatar Nov 04 '21 10:11 mgirlich

your last one is easily solved by (i.e. switching around filter and mutate):

lf %>%  
   filter(y == 1)%>%
   mutate(x = x + 1) %>% show_query()

I'm very interested for optimisation for the group_by clause though, because that

   memdb_frame(x = c(1,1,1,1), y = c(2,2,3,3), z=c(1,2,3,4)) %>% 
   filter(x < 2) %>% 
   group_by(y)%>%summarise(x1=sum(z))%>%
   select(y, x1) %>% 
   show_query()

produces

  <SQL>
 SELECT `y`, `x1`
 FROM (SELECT `y`, SUM(`z`) AS `x1`
 FROM `dbplyr_023`
 WHERE (`x` < 2.0)
 GROUP BY `y`)

but it really should be:

 <SQL>
 SELECT `y`, SUM(`z`) AS `x1`
 FROM `dbplyr_023`
 WHERE (`x` < 2.0)
 GROUP BY `y`

andia89 avatar Nov 05 '21 10:11 andia89

This workaround worked for me.

You can first build the list of expressions (that would go into each mutate()), then apply all at once.

This produces a single "layer" of SELECT ....

## First, generate the commands.
commands <- purrr::map2(variable_names, sql_code,
  ~rlang::exprs(!!..1 := !!dplyr::sql(..2))) |>
  unlist(recursive = FALSE)

## Then, apply all at once.
lazy_table <- dplyr::mutate(lazy_table,
  !!!commands)

fabkury avatar Jan 14 '23 03:01 fabkury