pointblank icon indicating copy to clipboard operation
pointblank copied to clipboard

Using arrow results in error "not really a table object"

Open DavZim opened this issue 1 year ago • 2 comments

Prework

  • [x] Read and agree to the code of conduct and contributing guidelines.
  • [x] If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • [x] Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • [x] Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • [x] Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • [x] Readable: format your code according to the tidyverse style guide.

Description

When interrogating an agent that is an arrow object, I get the following error: The 'table' in this validation step is not really a table object.

image

When I convert the arrow dataset to a data.frame first, pointblank works as expected

create_agent(as.data.frame(df)) |> # NOTE the as.data.frame here
  col_is_numeric(vars(x)) |> 
  interrogate()

#> ── Interrogation Started - there is a single validation step ──────────────────────────────────────────────── 
#> ✔ Step 1: OK.
#> ── Interrogation Completed ──────────────────────────────────────────────────────────────────────────────────

Reproducible example

library(pointblank)
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp


df <- arrow_table(x = 1:3, y = c("a", "b", "c"))

agent <- create_agent(df) |> 
  col_is_numeric(vars(x))
agent |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   NA     <NA>     NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <lgl>

agent |> interrogate() |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   TRUE   ERROR    NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <int>

# repeat with a database connection --------------------
write_dataset(df, "arrow-dataset")
ds <- open_dataset("arrow-dataset")

agent <- create_agent(ds) |> 
  col_is_numeric(vars(x))
agent |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   NA     <NA>     NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <lgl>

agent |> interrogate() |> get_agent_report(display_table = FALSE)
#> # A tibble: 1 × 14
#>       i type  columns values precon active eval  units n_pass f_pass W     S    
#>   <int> <chr> <chr>   <chr>  <chr>  <lgl>  <chr> <int>  <int>  <dbl> <lgl> <lgl>
#> 1     1 col_… x       <NA>   <NA>   TRUE   ERROR    NA     NA     NA NA    NA   
#> # … with 2 more variables: N <lgl>, extract <int>

Created on 2023-04-17 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23)
#>  os       Ubuntu 18.04.6 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Etc/UTC
#>  date     2023-04-17
#>  pandoc   2.18 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version  date (UTC) lib source
#>  arrow       * 11.0.0.3 2023-03-08 [1] RSPM (R 4.2.1)
#>  assertthat    0.2.1    2019-03-21 [1] RSPM (R 4.2.1)
#>  bit           4.0.4    2020-08-04 [1] RSPM (R 4.2.1)
#>  bit64         4.0.5    2020-08-30 [1] RSPM (R 4.2.1)
#>  blastula      0.3.3    2023-01-07 [1] RSPM (R 4.2.1)
#>  cli           3.6.0    2023-01-09 [1] RSPM (R 4.2.1)
#>  digest        0.6.31   2022-12-11 [1] RSPM (R 4.2.1)
#>  dplyr         1.1.1    2023-03-22 [1] RSPM (R 4.2.1)
#>  evaluate      0.16     2022-08-09 [1] RSPM (R 4.2.1)
#>  fansi         1.0.3    2022-03-24 [1] RSPM (R 4.2.1)
#>  fastmap       1.1.1    2023-02-24 [1] RSPM (R 4.2.1)
#>  fs            1.6.1    2023-02-06 [1] RSPM (R 4.2.1)
#>  generics      0.1.3    2022-07-05 [1] RSPM (R 4.2.1)
#>  glue          1.6.2    2022-02-24 [1] RSPM (R 4.2.1)
#>  htmltools     0.5.4    2022-12-07 [1] RSPM (R 4.2.1)
#>  knitr         1.42     2023-01-25 [1] RSPM (R 4.2.1)
#>  lifecycle     1.0.3    2022-10-07 [1] RSPM (R 4.2.1)
#>  magrittr      2.0.3    2022-03-30 [1] RSPM (R 4.2.1)
#>  pillar        1.8.1    2022-08-19 [1] RSPM (R 4.2.1)
#>  pkgconfig     2.0.3    2019-09-22 [1] RSPM (R 4.2.1)
#>  pointblank  * 0.11.3   2023-02-09 [1] RSPM (R 4.2.1)
#>  purrr         1.0.1    2023-01-10 [1] RSPM (R 4.2.1)
#>  R6            2.5.1    2021-08-19 [1] RSPM (R 4.2.1)
#>  reprex        2.0.2    2022-08-17 [2] RSPM (R 4.2.1)
#>  rlang         1.1.0    2023-03-14 [1] RSPM (R 4.2.1)
#>  rmarkdown     2.16     2022-08-24 [1] RSPM (R 4.2.1)
#>  rstudioapi    0.14     2022-08-22 [2] RSPM (R 4.2.1)
#>  sessioninfo   1.2.2    2021-12-06 [1] RSPM (R 4.2.1)
#>  tibble        3.2.1    2023-03-20 [1] RSPM (R 4.2.1)
#>  tidyselect    1.2.0    2022-10-10 [1] RSPM (R 4.2.1)
#>  utf8          1.2.2    2021-07-24 [1] RSPM (R 4.2.1)
#>  vctrs         0.6.1    2023-03-22 [1] RSPM (R 4.2.1)
#>  withr         2.5.0    2022-03-03 [2] RSPM (R 4.2.1)
#>  xfun          0.38     2023-03-24 [1] RSPM (R 4.2.1)
#>  yaml          2.3.7    2023-01-23 [1] RSPM (R 4.2.1)
#> 
#>  [1] /home/NAME/R/x86_64-pc-linux-gnu-library/4.2
#>  [2] /usr/r-library/admin-library/4.2
#>  [3] /opt/R/4.2.1/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

DavZim avatar Apr 17 '23 12:04 DavZim

Thanks for reporting this and providing a lot of details! This is definitely not right and requires a fix.

rich-iannone avatar Jul 21 '23 18:07 rich-iannone

FWIW I see this more of a feature request than a bug. Think of arrow as another backend. I think the error message is informative (while not perfect). An arrow dataset is neither a data.frame, nor a database table. So I would expect the current approach not to work. I couldn't find in the {pointblank} documentation a claim that the tbl argument of create_agent() can be an arrow::Table.

As a first suggestion it would be great to have the documentation of the supported backends in a more prominent location (e.g. a paragraph in the {pagedown} site).

A second suggestion: maybe, in a first instance, error with a clear message that arrow tables (or datasets, etc.) are not (yet) supported and have a follow-up issue to implement such support arrow inputs? (I have done some work on {arrow} in the past and I think this might not be a trivial endeavour).

(by the way, thanks a lot for the great package and for the R in Pharma workshop)

dragosmg avatar Oct 17 '23 18:10 dragosmg