skimr
skimr copied to clipboard
Improve performance (for big datasets)
First of all thanks for making such a useful package!
However I have been trying to skim
a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.
I had a look to see where the bottleneck was and the skim_by_type
methods and the build_results
function called are both very slow when there are lots of columns. ~~This looks to me like an issue with dplyr::across
when running summarize
with lots of column / function pairs. Notwithstanding that~~ I refactored skim_by_type
to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.
I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type
function, which should be able to replace all 3 of the current methods:
library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)
#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))
large_test_df <- bind_cols(
map(seq_len(length(coltypes)), function(col){
if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
if(coltypes[col] == "date") content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
if(coltypes[col] == "char") content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
tibble(!!sym(str_c("col",col)) := content)
}
))
#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type
skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
group_columns <- groups(data)
data <- as_tibble(data)
map_df(variable_names, function(skim_var){
pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
group_by(skim_variable, !!! group_columns) %>%
summarize(across(values, mangled_skimmers$funs), .groups = "drop")
}) %>%
rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
arrange(match(skim_variable, variable_names))
}
#### run test
microbenchmark(
{
assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
full_version <- skim(large_test_df)
},
{
assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
full_version <- skim(large_test_df)
},
times = 5
)
Unit: seconds
expr
{ assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr") full_version <- skimr::skim(large_test_df) }
{ assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr") full_version <- skimr::skim(large_test_df) }
min lq mean median uq max neval
251.1866 256.76540 256.68652 256.94793 258.26105 260.27161 5
10.1157 10.18126 10.30293 10.28105 10.35256 10.58411 5
Wow yes we have really been trying to think about the bottle necks, we have known that the processing was very slow, so this is really helpful! @michaelquinn32 thoughts?
In the original code we have skim_by_type.grouped_df()
and skim_by_type.data.frame()
(among others). But I don't think in your code you are using groups? Are you suggesting change to several functions or that we drop the differentiation?
Hey @elinw, yes I saw that in the original code skim_by_type
is split into three (?) methods to handle grouped_df
, data.table
and plain data.frames
. The single function above should handle each of these classes without issue but of course it could be split into different methods to handle each class separately. It just seemed easier to demonstrate like the above. If you are interested I'd be happy to formalise things in a new branch when I find the time but I don't permissions to do so at the moment.
Well to create a new branch you would fork the project and create the branch in your repo. We use the S3 approach because, among other things, it lets people develop skim methods for different kinds of objects. Now that you have raised this I think we are really at the point of maturity of skimr where working on performance (and, related, pushing into a database when appropriate) is really the biggest thing we can do.
That's awesome Henry. Thanks!
We would love a pull request. I would be happy to review your code.
As @elinw pointed out, create a branch with your change and we can go from there.
- If you haven't yet, take a look at our contributor's guide
- Code should pass a run of
lintr
with the default settings - And you should run
styler::style_pkg()
before we merge
I'm surprised that dplyr::across()
performs so much worse than pivoting the data first, but I'm thrilled to have discovered this opportunity for improving the user experience.
Thanks!
Hey @michaelquinn32 ,
I had a chance to look at this properly rather than just re-writing the function that was going slow.
First off I should clarify that it is not dplyr::across
that is causing issues. I am afraid to even think what the punishment is for people who bring dplyr
into disrepute.
Amazingly it is the call to as.data.frame
buried inside skimr:::reshape_skimmed
that causes the majority of slowness (see here). My explanation is that as.data.frame
gets slower as the object being converted gets larger. In combination with the increased calls to reshape_skimmed
(once per variable) this leads to an exponential performance degradation as you add columns.
Anyway, I have made an initial PR that simply changes as.data.frame
to as_tibble
. What I suggested above in the thread is faster and is quite a lot cleaner (in my opinion) so I can make a subsequent PR to rewrite the methods. Just this initial change is a very low hanging fruit.
Let's look at the timing for both grouped and ungrouped data with the as_tibble change. I'm wondering what we are thinking right there that we used as.data.frame() -- was there some kind of edge case? It's right in the middle of a bunch of tidyverse code. This is the commit https://github.com/ropensci/skimr/commit/81b7ddd8011be8565371e0d4d84e5f2bc46ed15e which essentially added the as.data.frame().
Should we be coercing to a tibble after we check that the input to skim is a data frame or coercible to a data frame?
Even when we merge that change, let's keep this issue open for the general topic of making skimr faster for different situations.
Also let's consider performance without histograms as th base since we know they have a big impact.
Hi, love skimr and use it a lot. Are there any updates on the performance issue? My datasets are 50-300+ vars and 8-60 mlns observations and skimr takes a long time to do summaries. Thank you!
Thanks for using skimr!
We do support summaries of data.table objects. Have you tried that? Is it faster?
Otherwise, scaling up to that size is a big challenge for the design of skimr, and I'm not exactly sure what we could do. We've both been very stretched for time recently, so a big change (like support parallel backends, etc.) might be a ways away still.
skim
takes about ten minutes for my dataset, a tibble with 150k rows and 1000 columns. This is a lot of columns but not unreasonable for raw medical data. For comparison, summary
does not do as much, but finishes in 30 seconds.