dbplyr Bug in 2.2.0 (carried forward to 2.2.1) - code that worked in 2.1.1. no longer works

Not sure if I can be more specific than the title, but at least I narrowed it down to the specific version, 2.2.0, where the my code starts to break. And the bug is carried forward to 2.2.1.

I have a fairly hefty dbplyr lazy query (not sure if worth sharing, but can do on request). It runs fine and creates a tbl_Oracle object. The lazy evaluation is also fine, outputting the top 10 rows.

It is when I try to collect() the query into a local data.frame that it gives me an error:

Error: nanodbc/nanodbc.cpp:1655: HY000: [Oracle][ODBC][Ora]ORA-24374: define not done before fetch or execute and fetch

A colleague has a different behaviour - instead of the error, their R Studio crashes entirely.

If I extract the raw SQL using show_query() and run it in another client, such as SQL Developer, it runs fine, which shows this isn't a translation issue.

Aug 12 '22 13:08 mgarbuzov

Can you try to make a reprex for you issue? See Reprexes for dbplyr for information on how to create one.

Aug 16 '22 08:08 mgirlich

@mgirlich please find a reprex below. Getting a slightly different error with an in-memory db, parser stack overflow, slightly different from the Oracle-specific error in the real db. Apologies, I couldn't make the reprex shorter, but I think this is because the issue is the query is too long. The breakdowns list used in the loop contains 11 breakdowns, if a few of them are removed, the error disappears, so my guess is that it is indeed to do with the sheer length of the query. However, the SQL query, as seen using show_query(), is still valid and runs in a native SQL client, as well as in the older dbplyr 2.1.1.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

local_tbl <- structure(list(YEAR_MONTH = c("202001", "202002", "202003", "202004", 
                                           "202005"), PERSON_ID = c(490416, 490416, 490416, 490416, 490416
                                           ), FINANCIAL_YEAR = c("2019/20", "2019/20", "2019/20", "2020/21", 
                                                                 "2020/21"), PERSON_OPTOUT_DECISION = c("No", "No", "No", "No", 
                                                                                                        "No"), VPD_REGION = c("Wales", "Wales", "Wales", "Wales", "Wales"
                                                                                                        ), VPD_TRUST_TYPE_GROUP = c("ACUTE", "ACUTE", "ACUTE", "ACUTE", 
                                                                                                                                    "ACUTE"), VPD_NAME = c("Cardiff and Vale University LHB", "Cardiff and Vale University LHB", 
                                                                                                                                                           "Cardiff and Vale University LHB", "Cardiff and Vale University LHB", 
                                                                                                                                                           "Cardiff and Vale University LHB"), AGE_BAND = c("41-45", "41-45", 
                                                                                                                                                                                                            "41-45", "41-45", "41-45"), GENDER = c("M", "M", "M", "M", "M"
                                                                                                                                                                                                            ), PERSON_ETHNIC_5 = c("White", "White", "White", "White", "White"
                                                                                                                                                                                                            ), UK_NATIONAL_FLAG = c("Y", "Y", "Y", "Y", "Y"), CONTRACT_TYPE = c("Permanent", 
                                                                                                                                                                                                                                                                                "Permanent", "Permanent", "Permanent", "Permanent"), STAFF_GROUP = c("Administrative and Clerical", 
                                                                                                                                                                                                                                                                                                                                                     "Administrative and Clerical", "Administrative and Clerical", 
                                                                                                                                                                                                                                                                                                                                                     "Administrative and Clerical", "Administrative and Clerical"), 
                            AFC_BAND = c("Band 6", "Band 6", "Band 6", "Band 6", "Band 6"
                            ), CONTINENT = c("Europe", "Europe", "Europe", "Europe", 
                                             "Europe"), PENSION_CONTRIBUTION_BAND = c("2_Medium", "2_Medium", 
                                                                                      "2_Medium", "2_Medium", "2_Medium"), DISABILITY_HIGH_LEVEL_CATEGORY = c("No", 
                                                                                                                                                              "No", "No", "No", "No"), FTE_CATEGORY = c("5_PT_0.75<=1.00", 
                                                                                                                                                                                                        "5_PT_0.75<=1.00", "5_PT_0.75<=1.00", "5_PT_0.75<=1.00", 
                                                                                                                                                                                                        "5_PT_0.75<=1.00"), PERSON_PRIMARY_FLAG = c(1, 1, 1, 1, 1
                                                                                                                                                                                                        )), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
                                                                                                                                                                                                                                                                    -5L))

base_db <- local_tbl |> tbl_memdb()

base_db <- base_db |>
  filter(PERSON_PRIMARY_FLAG == 1) |> # keep one person month only
  filter(YEAR_MONTH >= 202004)

latest_month <- base_db |>
  summarise(LATEST_YEAR_MONTH = max(YEAR_MONTH, na.rm = TRUE)) |>
  ungroup() |>
  pull()

analysis_period <- c("2020/21", "2021/22", latest_month)

base_year_db <- base_db |>
  group_by(FINANCIAL_YEAR, PERSON_ID) |>
  summarise(YEAR_MONTH = max(YEAR_MONTH, na.rm = TRUE)) |>
  ungroup() |>
  inner_join(
    base_db |>
      select(
        YEAR_MONTH, PERSON_ID, FINANCIAL_YEAR, PERSON_OPTOUT_DECISION, VPD_REGION,
        VPD_TRUST_TYPE_GROUP, VPD_NAME, AGE_BAND, GENDER, PERSON_ETHNIC_5,
        UK_NATIONAL_FLAG, CONTRACT_TYPE, STAFF_GROUP, AFC_BAND, CONTINENT,
        PENSION_CONTRIBUTION_BAND, DISABILITY_HIGH_LEVEL_CATEGORY, FTE_CATEGORY
      ),
    by = c("YEAR_MONTH", "PERSON_ID", "FINANCIAL_YEAR")
  )


breakdowns <- list(
  "VPD" = c(BREAKDOWN_NAME = c("VPD_TRUST_TYPE_GROUP", "VPD_REGION", "VPD_NAME")),
  "Age band and gender" = c(BREAKDOWN_NAME = c("GENDER", "AGE_BAND")),
  "Ethnicity (5 groups)" = c(BREAKDOWN_NAME = "PERSON_ETHNIC_5"),
  "Nationality (UK/Non-UK/Not asssigned)" = c(BREAKDOWN_NAME = "UK_NATIONAL_FLAG"),
  "Contract type" = c(BREAKDOWN_NAME = "CONTRACT_TYPE"),
  "Staff group" = c(BREAKDOWN_NAME = "STAFF_GROUP"),
  "AFC band" = c(BREAKDOWN_NAME = "AFC_BAND"),
  "Continent" = c(BREAKDOWN_NAME = "CONTINENT"),
  "FTE category" = c(BREAKDOWN_NAME = "FTE_CATEGORY"),
  "Pension contribution band" = c(BREAKDOWN_NAME = "PENSION_CONTRIBUTION_BAND"),
  "Disability flag" = c(BREAKDOWN_NAME = "DISABILITY_HIGH_LEVEL_CATEGORY")
)


for (breakdown_name in names(breakdowns)) {
  for (period in analysis_period) {

    breakdown_cols <- breakdowns[[breakdown_name]]

    tmp_db <- base_year_db |>
      mutate(PERIOD = case_when(
        period %in% c("2020/21", "2021/22") ~ FINANCIAL_YEAR,
        period == latest_month ~ YEAR_MONTH
      )) |>
      filter(PERIOD == period) |>
      group_by(
        PERIOD,
        across(all_of(unname(breakdown_cols)))
      ) |>
      summarise(
        OPTOUT_COUNT = sum(ifelse(PERSON_OPTOUT_DECISION == "Yes", 1, 0), na.rm = TRUE),
        TOTAL = n()
      ) |>
      ungroup() |>
      mutate(
        OPTOUT_RATE = OPTOUT_COUNT / TOTAL * 100,
        METRIC = breakdown_name
      )
    

    if (breakdown_name == "VPD" & period == "2020/21") {
      year_breakdown_db <- tmp_db
    } else {
      year_breakdown_db <- union_all(
        x = year_breakdown_db,
        y = tmp_db
      )
    }
  }
}

metrics_df <- year_breakdown_db |> collect()
#> `summarise()` has grouped output by "PERIOD", "VPD_TRUST_TYPE_GROUP", and
#> "VPD_REGION". You can override using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD", "VPD_TRUST_TYPE_GROUP", and
#> "VPD_REGION". You can override using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD", "VPD_TRUST_TYPE_GROUP", and
#> "VPD_REGION". You can override using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD" and "GENDER". You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD" and "GENDER". You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD" and "GENDER". You can override
#> using the `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> `summarise()` has grouped output by "PERIOD". You can override using the
#> `.groups` argument.
#> `summarise()` has grouped output by "FINANCIAL_YEAR". You can override using
#> the `.groups` argument.
#> Error: parser stack overflow

^{Created on 2022-08-16 by the reprex package (v2.0.1)}

Aug 16 '22 10:08 mgarbuzov

You create a very big query indeed. I don't know why the query works in a native client but not in dbplyr. But I think you are better off collecting in each iteration, e.g. something like

f_summary <- function(breakdown_name, period) {
  breakdown_cols <- breakdowns[[breakdown_name]]
  
  tmp_db <- base_year_db |>
    mutate(PERIOD = case_when(
      period %in% c("2020/21", "2021/22") ~ FINANCIAL_YEAR,
      period == latest_month ~ YEAR_MONTH
    )) |>
    filter(PERIOD == period) |>
    group_by(
      PERIOD,
      across(all_of(unname(breakdown_cols)))
    ) |>
    summarise(
      OPTOUT_COUNT = sum(ifelse(PERSON_OPTOUT_DECISION == "Yes", 1, 0), na.rm = TRUE),
      TOTAL = n(),
      .groups = "drop"
    ) |>
    mutate(
      OPTOUT_RATE = OPTOUT_COUNT / TOTAL * 100,
      METRIC = breakdown_name
    ) %>% 
    collect()
  
  if (nrow(tmp_db) == 0) {
    return(NULL)
  }
  
  tmp_db
}

metrics_df <- tidyr::crossing(breakdown_name = names(breakdowns), period = analysis_period) %>% 
  purrr::pmap_dfr(f_summary)

In the end dbplyr is not made for huge queries. Also the database is usually not so happy about these huge queries.

Aug 16 '22 13:08 mgirlich

@mgirlich but it was working in dbplyr 2.1.1, I wonder if a small change was made somewhere that broke it and is relatively easy to restore to full functionality?

Aug 16 '22 15:08 mgarbuzov

Unfortunately, this ist not the case. There were some pretty big changes between 2.1.1 and 2.2.0. I couldn't reproduce your issue. At least for SQLite it sounds like the query has too many subqueries. The issue for Oracle might be the same or it might be different. That the query itself works if you copy paste it, is quite confusing. Sorry, but I really can't help here.

Aug 17 '22 04:08 mgirlich

dbplyr dbplyr copied to clipboard

Bug in 2.2.0 (carried forward to 2.2.1) - code that worked in 2.1.1. no longer works

dbplyr
dbplyr copied to clipboard