visdat icon indicating copy to clipboard operation
visdat copied to clipboard

Speeding up visdat

Open njtierney opened this issue 6 years ago • 2 comments

After some discussion with Mike, here are some ways to speedup visdat:

  • Revisit fingerprint - change so that I don't paste in every element (minor speedup)
  • Draw visdat as a series of rectangles with segment lines drawn over the top to show the missing values. This would then require specifying two datasets - one for the coordinates of the rectangles, and one for the positions of the NA rows.

njtierney avatar Dec 04 '17 06:12 njtierney

Could possibly use rle to create the encodings / start-end points for each rectangle.

rle(airquality$Ozone)
#> Run Length Encoding
#>   lengths: int [1:152] 1 1 1 1 1 1 1 1 1 1 ...
#>   values : int [1:152] 41 36 12 18 NA 28 23 19 8 NA ...

Created on 2019-06-08 by the reprex package (v0.2.1)

njtierney avatar Jun 08 '19 04:06 njtierney

It looks like I might be able to use an alternative implementation of fingerprint that is a bit faster for larger vectors.

fingerprint <- function(x){
  
  x_class <- class(x)
  # is the data missing?
  ifelse(is.na(x),
         # yes? Leave as is NA
         yes = NA,
         # no? make that value no equal to the class of this cell.
         no = glue::glue_collapse(x_class,
                                  sep = "\n")
  )
} # end function

fingerprint_2 <- function(x){
  # is the data missing?
  x_class <- class(x)
  dplyr::if_else(condition = is.na(x),
         # yes? Leave as is NA
         true = NA_character_,
         # no? make that value no equal to the class of this cell.
         false = as.character(glue::glue_collapse(x_class,
                                     sep = "\n"))
         )
} # end function

create_vec <- function(size){
  vec <- runif(size)
  vec[sample(vctrs::vec_seq_along(vec), size = round(size/10))] <- NA
  vec
}

fingerprint(create_vec(100))
#>   [1] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>   [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [15] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [22] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [29] NA        "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [36] "numeric" "numeric" "numeric" NA        NA        "numeric" "numeric"
#>  [43] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [50] "numeric" "numeric" NA        "numeric" "numeric" "numeric" "numeric"
#>  [57] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [64] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [71] NA        NA        "numeric" "numeric" NA        "numeric" "numeric"
#>  [78] "numeric" "numeric" "numeric" "numeric" NA        "numeric" NA       
#>  [85] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [92] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [99] "numeric" NA
fingerprint_2(create_vec(100))
#>   [1] NA        "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>   [8] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [15] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [22] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" NA       
#>  [29] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [36] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" NA       
#>  [43] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [50] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [57] "numeric" NA        "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [64] "numeric" "numeric" "numeric" "numeric" NA        "numeric" "numeric"
#>  [71] NA        NA        NA        "numeric" "numeric" "numeric" NA       
#>  [78] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [85] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" NA       
#>  [92] "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
#>  [99] "numeric" "numeric"

bm1 <- bench::press(
  size = c(1e2, 1e3, 1e4, 1e5, 1e6),
  {
    vec <- create_vec(size)
    bench::mark(
      new = fingerprint_2(vec),
      old = fingerprint(vec)
    )
  }
)
#> Running with:
#>      size
#> 1     100
#> 2    1000
#> 3   10000
#> 4  100000
#> 5 1000000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

plot(bm1)
#> Loading required namespace: tidyr

summary(bm1)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 10 x 7
#>    expression    size      min   median `itr/sec` mem_alloc `gc/sec`
#>    <bch:expr>   <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#>  1 new            100  53.27µs   62.7µs  13557.     56.03KB    12.0 
#>  2 old            100  45.88µs  50.67µs  17296.      18.5KB     7.90
#>  3 new           1000  99.45µs 133.53µs   6725.     63.19KB     7.97
#>  4 old           1000 157.55µs 186.07µs   5136.     50.97KB     4.00
#>  5 new          10000 769.07µs 917.66µs    899.    625.69KB     9.99
#>  6 old          10000   1.68ms   1.97ms    462.    504.48KB     3.98
#>  7 new         100000   5.49ms   6.57ms    136.       6.1MB    16.0 
#>  8 old         100000  15.56ms  18.01ms     51.2     4.92MB     5.91
#>  9 new        1000000  61.29ms  71.12ms     11.3    61.04MB    28.3 
#> 10 old        1000000 151.73ms 155.05ms      6.44   49.21MB     4.83

Created on 2021-05-28 by the reprex package (v2.0.0)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 4.0.5 (2021-03-31)
#>  os       macOS Big Sur 10.16         
#>  system   x86_64, darwin17.0          
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  ctype    en_AU.UTF-8                 
#>  tz       Australia/Brisbane          
#>  date     2021-05-28                  
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date       lib source            
#>  assertthat    0.2.1   2019-03-21 [1] standard (@0.2.1) 
#>  backports     1.2.1   2020-12-09 [1] standard (@1.2.1) 
#>  beeswarm      0.3.1   2021-03-07 [1] CRAN (R 4.0.2)    
#>  bench         1.1.1   2020-01-13 [1] CRAN (R 4.0.2)    
#>  cli           2.5.0   2021-04-26 [1] CRAN (R 4.0.2)    
#>  colorspace    2.0-0   2020-11-11 [1] standard (@2.0-0) 
#>  crayon        1.4.1   2021-02-08 [1] CRAN (R 4.0.2)    
#>  curl          4.3     2019-12-02 [1] standard (@4.3)   
#>  DBI           1.1.1   2021-01-15 [1] CRAN (R 4.0.2)    
#>  digest        0.6.27  2020-10-24 [1] standard (@0.6.27)
#>  dplyr         1.0.6   2021-05-05 [1] CRAN (R 4.0.2)    
#>  ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.0.2)    
#>  evaluate      0.14    2019-05-28 [1] standard (@0.14)  
#>  fansi         0.4.2   2021-01-15 [1] CRAN (R 4.0.2)    
#>  farver        2.1.0   2021-02-28 [1] CRAN (R 4.0.2)    
#>  fs            1.5.0   2020-07-31 [1] standard (@1.5.0) 
#>  generics      0.1.0   2020-10-31 [1] standard (@0.1.0) 
#>  ggbeeswarm    0.6.0   2017-08-07 [1] CRAN (R 4.0.2)    
#>  ggplot2       3.3.3   2020-12-30 [1] CRAN (R 4.0.2)    
#>  glue          1.4.2   2020-08-27 [1] standard (@1.4.2) 
#>  gtable        0.3.0   2019-03-25 [1] standard (@0.3.0) 
#>  highr         0.8     2019-03-20 [1] standard (@0.8)   
#>  htmltools     0.5.1.1 2021-01-22 [1] CRAN (R 4.0.2)    
#>  httr          1.4.2   2020-07-20 [1] standard (@1.4.2) 
#>  knitr         1.33    2021-04-24 [1] CRAN (R 4.0.2)    
#>  lifecycle     1.0.0   2021-02-15 [1] CRAN (R 4.0.2)    
#>  magrittr      2.0.1   2020-11-17 [1] standard (@2.0.1) 
#>  mime          0.10    2021-02-13 [1] CRAN (R 4.0.2)    
#>  munsell       0.5.0   2018-06-12 [1] standard (@0.5.0) 
#>  pillar        1.6.1   2021-05-16 [1] CRAN (R 4.0.2)    
#>  pkgconfig     2.0.3   2019-09-22 [1] standard (@2.0.3) 
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.0.2)    
#>  purrr         0.3.4   2020-04-17 [1] standard (@0.3.4) 
#>  R6            2.5.0   2020-10-28 [1] standard (@2.5.0) 
#>  reprex        2.0.0   2021-04-02 [1] CRAN (R 4.0.2)    
#>  rlang         0.4.11  2021-04-30 [1] CRAN (R 4.0.2)    
#>  rmarkdown     2.8     2021-05-07 [1] CRAN (R 4.0.2)    
#>  rstudioapi    0.13    2020-11-12 [1] standard (@0.13)  
#>  scales        1.1.1   2020-05-11 [1] standard (@1.1.1) 
#>  sessioninfo   1.1.1   2018-11-05 [1] standard (@1.1.1) 
#>  stringi       1.5.3   2020-09-09 [1] standard (@1.5.3) 
#>  stringr       1.4.0   2019-02-10 [1] standard (@1.4.0) 
#>  styler        1.4.1   2021-03-30 [1] CRAN (R 4.0.2)    
#>  tibble        3.1.2   2021-05-16 [1] CRAN (R 4.0.2)    
#>  tidyr         1.1.3   2021-03-03 [1] CRAN (R 4.0.2)    
#>  tidyselect    1.1.0   2020-05-11 [1] standard (@1.1.0) 
#>  utf8          1.2.1   2021-03-12 [1] CRAN (R 4.0.2)    
#>  vctrs         0.3.8   2021-04-29 [1] CRAN (R 4.0.2)    
#>  vipor         0.4.5   2017-03-22 [1] CRAN (R 4.0.2)    
#>  withr         2.4.2   2021-04-18 [1] CRAN (R 4.0.3)    
#>  xfun          0.23    2021-05-15 [1] CRAN (R 4.0.2)    
#>  xml2          1.3.2   2020-04-23 [1] standard (@1.3.2) 
#>  yaml          2.2.1   2020-02-01 [1] standard (@2.2.1) 
#> 
#> [1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library

njtierney avatar May 28 '21 06:05 njtierney