dbplyr icon indicating copy to clipboard operation
dbplyr copied to clipboard

Inability to remove `window_order()` leads to weird, unpredictable results.

Open iangow opened this issue 3 years ago • 14 comments

There is no way to remove window_order() from a remote data frame leading to weird, unpredictable results.

The only difference between the code generating method_1 and that generating method_2 is a single compute(). Yet one has ORDER BY "year" ROWS UNBOUNDED PRECEDING (method_1) where the other (method_2) has nothing.

  1. Some times we want the behaviour of method_2, but it is not clear how to get this. It is not always possible to do compute() (big part of the rationale for copy_inline()).
  2. Even if it is possible to compute(), it is not clear why the compute() should yield radically different results.

I guess ideally ungroup() should remove all these artefacts. If not, perhaps some other way to de-window_order() a remote data frame seems needed.

This seems related to #345 and #733. #345 was closed without solution for lack of a reprex. I have provided one below.

Brief description of the problem

library(DBI)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

set.seed(2024)
db <- dbConnect(duckdb::duckdb())

df <-
  expand_grid(id = letters[1:4], year = 1997:2001) %>%
  mutate(value = sample(0:5, size = nrow(.), replace = TRUE)) %>%
  copy_to(db, ., "df")

df_g <-
  df %>%
  group_by(id) %>%
  window_order(year) %>%
  window_frame(-Inf, 0) %>%
  mutate(cumsum = sum(value, na.rm = TRUE)) %>%
  ungroup()

df_g
#> # Source:     SQL [?? x 4]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: year
#>    id     year value cumsum
#>    <chr> <int> <int>  <dbl>
#>  1 a      1997     1      1
#>  2 a      1998     4      5
#>  3 a      1999     4      9
#>  4 a      2000     3     12
#>  5 a      2001     0     12
#>  6 d      1997     1      1
#>  7 d      1998     1      2
#>  8 d      1999     3      5
#>  9 d      2000     1      6
#> 10 d      2001     1      7
#> # ℹ more rows

method_1 <-
  df_g %>%
  group_by(id) %>%
  mutate(cumprop = 1.0 * cumsum/sum(value, na.rm = TRUE)) %>%
  ungroup()

method_1 %>%
  arrange(id, year)
#> # Source:     SQL [?? x 5]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: id, year
#>    id     year value cumsum cumprop
#>    <chr> <int> <int>  <dbl>   <dbl>
#>  1 a      1997     1      1       1
#>  2 a      1998     4      5       1
#>  3 a      1999     4      9       1
#>  4 a      2000     3     12       1
#>  5 a      2001     0     12       1
#>  6 b      1997     4      4       1
#>  7 b      1998     0      4       1
#>  8 b      1999     1      5       1
#>  9 b      2000     1      6       1
#> 10 b      2001     4     10       1
#> # ℹ more rows

method_1 %>%
  show_query()
#> <SQL>
#> SELECT
#>   *,
#>   (1.0 * cumsum) / SUM("value") OVER (PARTITION BY id ORDER BY "year" ROWS UNBOUNDED PRECEDING) AS cumprop
#> FROM (
#>   SELECT
#>     *,
#>     SUM("value") OVER (PARTITION BY id ORDER BY "year" ROWS UNBOUNDED PRECEDING) AS cumsum
#>   FROM df
#> ) q01

method_2 <-
  df_g %>%
  compute() %>%
  group_by(id) %>%
  mutate(cumprop = 1.0 * cumsum/sum(value, na.rm = TRUE)) %>%
  ungroup()

method_2 %>%
  arrange(id, year)
#> # Source:     SQL [?? x 5]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: id, year
#>    id     year value cumsum cumprop
#>    <chr> <int> <int>  <dbl>   <dbl>
#>  1 a      1997     1      1  0.0833
#>  2 a      1998     4      5  0.417 
#>  3 a      1999     4      9  0.75  
#>  4 a      2000     3     12  1     
#>  5 a      2001     0     12  1     
#>  6 b      1997     4      4  0.4   
#>  7 b      1998     0      4  0.4   
#>  8 b      1999     1      5  0.5   
#>  9 b      2000     1      6  0.6   
#> 10 b      2001     4     10  1     
#> # ℹ more rows

method_2 %>%
  show_query()
#> <SQL>
#> SELECT *, (1.0 * cumsum) / SUM("value") OVER (PARTITION BY id) AS cumprop
#> FROM dbplyr_001

Created on 2023-04-18 with reprex v2.0.2

iangow avatar Apr 16 '23 09:04 iangow

It seems to me that the group_by(year) should remove the effects any previous window_order().

iangow avatar Apr 16 '23 10:04 iangow

You can simply remove the window order via df_g |> window_order().

It seems to me that the group_by(year) should remove the effects any previous window_order().

I'm not sure that group_by() should remove the window order. @hadley Do you have an opinion on this?

mgirlich avatar Apr 18 '23 08:04 mgirlich

You can simply remove the window order via df_g |> window_order().

It seems to me that the group_by(year) should remove the effects any previous window_order().

I'm not sure that group_by() should remove the window order. @hadley Do you have an opinion on this?

I tried that (twice! … see updated example above), but effects of window_order() persist (in method_1) and the only way I can remove them seems to be to use compute() (method_2).

iangow avatar Apr 18 '23 10:04 iangow

And using RPostgres::Postgres() causes problems with method_1 (an error message), though method_2 is fine.

iangow avatar Apr 18 '23 10:04 iangow

OK. It's window_frame() that I needed to call!

I call window_frame() below to get method_1 to produce the results of method_2 (differs only by a compute()). No need for window_order(). Though surely it makes sense to remove window_frame() when calling window_order() to remove the window order (not clear what window_frame() means without a window order).

Perhaps this is a documentation issue. "To remove window order and frame, call window_order()."

library(DBI)
library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

set.seed(2024)
db <- dbConnect(duckdb::duckdb())

df <-
  expand_grid(id = letters[1:4], year = 1997:2001) %>%
  mutate(value = sample(0:5, size = nrow(.), replace = TRUE)) %>%
  copy_to(db, ., "df")

df_g <-
  df %>%
  group_by(id) %>%
  window_order(year) %>%
  window_frame(-Inf, 0) %>%
  mutate(cumsum = sum(value, na.rm = TRUE)) %>%
  ungroup() %>%
  window_frame()

df_g
#> # Source:     SQL [?? x 4]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: year
#>    id     year value cumsum
#>    <chr> <int> <int>  <dbl>
#>  1 a      1997     1      1
#>  2 a      1998     4      5
#>  3 a      1999     4      9
#>  4 a      2000     3     12
#>  5 a      2001     0     12
#>  6 d      1997     1      1
#>  7 d      1998     1      2
#>  8 d      1999     3      5
#>  9 d      2000     1      6
#> 10 d      2001     1      7
#> # ℹ more rows

method_1 <-
  df_g %>%
  group_by(id) %>%
  mutate(cumprop = 1.0 * cumsum/sum(value, na.rm = TRUE)) %>%
  ungroup()

method_1 %>%
  arrange(id, year)
#> # Source:     SQL [?? x 5]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: id, year
#>    id     year value cumsum cumprop
#>    <chr> <int> <int>  <dbl>   <dbl>
#>  1 a      1997     1      1  0.0833
#>  2 a      1998     4      5  0.417 
#>  3 a      1999     4      9  0.75  
#>  4 a      2000     3     12  1     
#>  5 a      2001     0     12  1     
#>  6 b      1997     4      4  0.4   
#>  7 b      1998     0      4  0.4   
#>  8 b      1999     1      5  0.5   
#>  9 b      2000     1      6  0.6   
#> 10 b      2001     4     10  1     
#> # ℹ more rows

method_1 %>%
  show_query()
#> <SQL>
#> SELECT
#>   *,
#>   (1.0 * cumsum) / SUM("value") OVER (PARTITION BY id ORDER BY "year" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS cumprop
#> FROM (
#>   SELECT
#>     *,
#>     SUM("value") OVER (PARTITION BY id ORDER BY "year" ROWS UNBOUNDED PRECEDING) AS cumsum
#>   FROM df
#> ) q01

method_2 <-
  df_g %>%
  compute() %>%
  group_by(id) %>%
  mutate(cumprop = 1.0 * cumsum/sum(value, na.rm = TRUE)) %>%
  ungroup()

method_2 %>%
  arrange(id, year)
#> # Source:     SQL [?? x 5]
#> # Database:   DuckDB 0.7.1 [root@Darwin 22.4.0:R 4.2.2/:memory:]
#> # Ordered by: id, year
#>    id     year value cumsum cumprop
#>    <chr> <int> <int>  <dbl>   <dbl>
#>  1 a      1997     1      1  0.0833
#>  2 a      1998     4      5  0.417 
#>  3 a      1999     4      9  0.75  
#>  4 a      2000     3     12  1     
#>  5 a      2001     0     12  1     
#>  6 b      1997     4      4  0.4   
#>  7 b      1998     0      4  0.4   
#>  8 b      1999     1      5  0.5   
#>  9 b      2000     1      6  0.6   
#> 10 b      2001     4     10  1     
#> # ℹ more rows

method_2 %>%
  show_query()
#> <SQL>
#> SELECT *, (1.0 * cumsum) / SUM("value") OVER (PARTITION BY id) AS cumprop
#> FROM dbplyr_001

Created on 2023-04-18 with reprex v2.0.2

iangow avatar Apr 18 '23 10:04 iangow

Though note the artefacts of the previous windowing remain in the query (see output from show_query() above. Even putting window_order() in their two does not change this.

iangow avatar Apr 18 '23 11:04 iangow

Now you've lost me... Can you create a reprex and clearly state what's your issue? Otherwise I can't really help you.

mgirlich avatar Apr 18 '23 11:04 mgirlich

Now you've lost me... Can you create a reprex and clearly state what's your issue? Otherwise I can't really help you.

I have restored a version of the reprex with the initial issue at the top. There you can see that artefacts of windows persist, leading to weird query results that seem difficult to predict. Using compute() seems the only way to remove these artefacts. As you can see, results of method_1 are very different from those of method_2 even though latter just adds a compute().

I get correct results after adding window_frame() call (reprex in comment just above), but show_query() suggests that elements of windows are bleeding through into later query with no way to remove them (short of compute()).

iangow avatar Apr 18 '23 11:04 iangow

Your reprex is very complicated. Please try to boil it down next time. I think the essence is caught by:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

mf <- memdb_frame(g = 1, x = 1, y = 1, z = 1) %>% 
  group_by(g) %>% 
  window_order(x) %>% 
  window_frame(-Inf, 0)
mf2 <- mf %>% compute()

mf %>% 
  mutate(cumsum = sum(z, na.rm = TRUE)) %>% 
  show_query()
#> <SQL>
#> SELECT
#>   *,
#>   SUM(`z`) OVER (PARTITION BY `g` ORDER BY `x` ROWS UNBOUNDED PRECEDING) AS `cumsum`
#> FROM `dbplyr_001`

mf2 %>% 
  mutate(cumsum = sum(z, na.rm = TRUE)) %>% 
  show_query()
#> <SQL>
#> SELECT *, SUM(`z`) OVER (PARTITION BY `g`) AS `cumsum`
#> FROM `dbplyr_002`

Created on 2023-04-18 with reprex v2.0.2

mgirlich avatar Apr 18 '23 11:04 mgirlich

I wonder if we should add a .frame argument to mutate() so that it was easier to apply these effects locally.

hadley avatar Apr 18 '23 14:04 hadley

I wonder if we should add a .frame argument to mutate() so that it was easier to apply these effects locally.

I wonder if a more global approach wouldn't be to tweak the meaning of ungroup() to remove all aggregation windows. Also possibly add .groups = "drop" option to mutate(). In a way, because we do group_by() |> summarize() (not summarize(., .group_by)), it seems consistent to allow for group_by() |> window_order() |> window_frame() %>% mutate() (or even just add a .frame to window_order()).

It seems that even with the "local effect" solution, code should still behave predictably if users have supplied a remote data frame with attributes created by window_order() and window_frame() to mutate().

I think that dplyr brings closer together two things that are further apart in SQL. (Please forgive my ad hoc taxonomy here.)

  1. Aggregation queries
  • group_by() |> summarize() in dplyr
  • GROUP BY + aggregate functions in SQL
  • Aggregate functions operate over a single window per group and return a single value for each group
  1. Window queries
  • group_by() |> arrange() |> summarize() in dplyr
  • group_by() |> window_order() |> window_frame() |> mutate() in dbplyr
  • Aggregate functions OVER (PARTITION BY x ORDER BY y ROWS BETWEEN a AND b in SQL
  • Aggregate functions operate over a different window per row and return a single value for each row

With aggregation queries in dbplyr, I can use .groups = "drop" or ungroup() to remove the windows. But there is no clear equivalent function with window queries. I think it does not make sense to remove one element of the windows (e.g., the group_by() element, but not the window_order() and window_frame() elements), so perhaps ungroup() should remove all of them.

At times I find myself converting a query from summarise() to mutate(), but then need to remove .groups = "drop" (I think I always use .groups = "drop" with summarise()), so I think .groups = "drop" would be better than .frames = "drop" or similar.

A benefit of there being no dplyr implementations of window_order() and window_frame is that a slightly more dogmatic approach might break fewer things (I wonder how many queries in practice rely on windows persisting over multiple mutate() calls).

iangow avatar Apr 18 '23 20:04 iangow

Your reprex is very complicated. Please try to boil it down next time. I think the essence is caught by:

Yes. I was confused and struggled to "work forward" from simple code until I encountered the issue, so ended up "working backward" from complex code with the issue and may have stopped too early. (BTW, you can omit , y = 1 in your reprex.)

In your reprex, I think the fact that the windows don't survive compute() is understandable. And the fact that they do persist in mf also makes sense. What's missing in my mind is a non-compute() way to make them disappear when we want them to go away.

iangow avatar Apr 18 '23 21:04 iangow

@iangow I was thinking that the new arguments would work like .by:

mf <- memdb_frame(g = 1, x = 1, y = 1, z = 1)

mf |> 
  mutate(
    cumsum = sum(z, na.rm = TRUE),
    .by = g,
    .order = x,
    .frame = c(-Inf, 0)
  )

But probably also worth either making ungroup() remove the window attributes or add a new unwindow() function.

hadley avatar Apr 18 '23 21:04 hadley

@iangow I was thinking that the new arguments would work like .by:

mf <- memdb_frame(g = 1, x = 1, y = 1, z = 1)

mf |> 
  mutate(
    cumsum = sum(z, na.rm = TRUE),
    .by = g,
    .order = x,
    .frame = c(-Inf, 0)
  )

But probably also worth either making ungroup() remove the window attributes or add a new unwindow() function.

Déjà vu? See here.

Having both would be nice, I guess. (Recently I came across a window function that embeds the .frame equivalent inside the function. I can look it up if you'd like, but I certainly don't think that's necessary.)

I have not used .by, so don't know whether it is expected to handle users doing this:

mf |> 
   group_by(g) |>
   mutate(
     cumsum = sum(z, na.rm = TRUE),
     .order = x,
     .frame = c(-Inf, 0)
   )

or

mf |> 
   group_by(g) |>
   window_order(x) |>
   mutate(
     cumsum = sum(z, na.rm = TRUE),
     .frame = c(-Inf, 0)
   )

iangow avatar Apr 18 '23 21:04 iangow

Superseded by #1542

hadley avatar Dec 05 '25 21:12 hadley