visdat icon indicating copy to clipboard operation
visdat copied to clipboard

Using visdat with a tbl_dbi database object

Open ablack3 opened this issue 6 years ago • 3 comments

Is it possible to use visdat with a database table (referenced by a tbl_dbi object in R)?

ablack3 avatar Apr 05 '18 22:04 ablack3

Here is an idea. Can I supply visdat with a table containing the missing value pattern in the database table and get a nice plot?

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wakefield))

df <- r_data_frame(n= 100, id, age, height) %>% 
     r_na(prob = .4)

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")


tbl_dbi <- copy_to(con, df, "df")
tbl_dbi
#> # Source:   table<df> [?? x 3]
#> # Database: sqlite 3.19.3 []
#>    ID      Age Height
#>    <chr> <int>  <dbl>
#>  1 001      NA   71.0
#>  2 002      21   68.0
#>  3 003      NA   62.0
#>  4 004      34   NA  
#>  5 005      NA   NA  
#>  6 006      20   NA  
#>  7 007      31   NA  
#>  8 008      24   72.0
#>  9 009      20   74.0
#> 10 010      NA   70.0
#> # ... with more rows

# function to pull info from database to feed into visdat
count_missing <- function(tbl_dbi, ...){
     dots <- rlang::enquos(...)
     tbl_dbi %>% 
          ungroup() %>% 
          mutate_at(vars(!!!dots), funs(case_when(
               . == "" ~ "blank",
               is.na(.) ~ "NA",
               TRUE ~ "not blank/NA"
          ))) %>% 
          count(!!!dots) 
}

# pull info from database needed to construct visdat plot
count_missing(tbl_dbi, Age, Height)
#> # Source:   lazy query [?? x 3]
#> # Database: sqlite 3.19.3 []
#> # Groups:   Age
#>   Age          Height           n
#>   <chr>        <chr>        <int>
#> 1 NA           NA              16
#> 2 NA           not blank/NA    24
#> 3 not blank/NA NA              24
#> 4 not blank/NA not blank/NA    36
Created on 2018-04-06 by the reprex package (v0.2.0).

ablack3 avatar Apr 06 '18 18:04 ablack3

Thanks for filing an issue, and for the nice illustrative example!

I would like to have visdat work with objects "tbl_dbi" "tbl_sql" "tbl_lazy" "tbl" - the way that I think would be best to address this would be to create separate functions for each of the vis_ family to operate on tbl_dbi objects.

So currently there is vis_dat() which checks if you have a dataframe, but then there would be vis_dat.tbl_dbi, which would "just work" with tbl_dbi - so you'd type vis_dat(tbl_dbi) and you'd get a plot that you would expect.

This might be made easier by some of the ideas in retrieving the data, described in #78.

At the moment I don't think that this will make it into the next release of visdat (submitting today or tomorrow to CRAN), but I would be really interested to discuss this in more detail - would you be interested in discussing this further?

njtierney avatar Jun 04 '18 04:06 njtierney

Sure! I think that it would be handy to easily get a missing data plot from a database table. The application I'm thinking of would be for use with large datasets that are too big to get into R. Your idea of making the vis_dat function generic and creating a tbl_dbi method seems like a good one to me. For column oriented databases creating a viz based on a few columns should be very fast even for a huge number of rows.

Would you be able to create a lower level internal function that would accept something like the table produced by the count_missing function I defined above and produce a viz?

ablack3 avatar Jun 07 '18 11:06 ablack3