skimr Improve performance (for big datasets)

First of all thanks for making such a useful package!

However I have been trying to skim a dataset with lots of columns (thousands) and noticed very poor performance. The full dataset with >20,000 columns ran overnight without completing, so it seems the performance gets exponentially worse. I found similar issues being discussed in #370 and elsewhere.

I had a look to see where the bottleneck was and the skim_by_type methods and the build_results function called are both very slow when there are lots of columns. ~~This looks to me like an issue with dplyr::across when running summarize with lots of column / function pairs. Notwithstanding that~~ I refactored skim_by_type to improved performance by a factor of 25 for a 100,000 x 1,500 dataset. The larger dataset I am working with (which previously did not complete overnight) runs in ~1 minute. I may be missing something here so apologies in advance if so.

I am happy to make a branch to demonstrate this properly/ open for improvement but for now see below for a reproducible example showing the relatively performance for the refactored skim_by_type function, which should be able to replace all 3 of the current methods:

library(tidyverse)
library(skimr)
library(stringi)
library(microbenchmark)

#### Make some wide test data
number_rows <- 100000
coltypes <- c(rep("numeric",500), rep("date",500), rep("char", 500))

large_test_df <- bind_cols(
	map(seq_len(length(coltypes)), function(col){
		if(coltypes[col] == "numeric") content <- round(runif(number_rows, 0, col), sample(c(0,1,2,3), 1))
		if(coltypes[col] == "date")    content <- sample(seq(as.Date('1950-01-01'), Sys.Date(), by="day"), number_rows, replace = T)
		if(coltypes[col] == "char")    content <- sample(stri_rand_strings(500, sample(c(1,3,5), 1)), number_rows, replace = T)
		tibble(!!sym(str_c("col",col)) := content)
	}
	))

#### save the original function and define refactored function
skim_by_type_original <- skimr:::skim_by_type

skim_by_type_refactor <- function(mangled_skimmers, variable_names, data){
	group_columns <- groups(data)
	data <- as_tibble(data)
	map_df(variable_names, function(skim_var){
		pivot_longer(select(data, !! skim_var, !!! group_columns), cols = all_of(skim_var), names_to = "skim_variable", values_to = "values") %>%
			group_by(skim_variable, !!! group_columns) %>%
			summarize(across(values, mangled_skimmers$funs), .groups = "drop")
	})	%>%
		rename_with(~str_remove(.x,"values_~!@#\\$%\\^\\&\\*\\(\\)-\\+"), everything()) %>%
		arrange(match(skim_variable, variable_names))
}

#### run test
microbenchmark(
	{
		assignInNamespace("skim_by_type", skim_by_type_original, ns="skimr")
		full_version <- skim(large_test_df)
	},
	{
		assignInNamespace("skim_by_type", skim_by_type_refactor, ns="skimr")
		full_version <- skim(large_test_df)
	},
	times = 5
)

Unit: seconds
                                                                                                                                   expr
 {     assignInNamespace("skim_by_type", skim_by_type_original, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
 {     assignInNamespace("skim_by_type", skim_by_type_refactor, ns = "skimr")     full_version <- skimr::skim(large_test_df) }
      min        lq      mean    median        uq       max neval
 251.1866 256.76540 256.68652 256.94793 258.26105 260.27161     5
  10.1157  10.18126  10.30293  10.28105  10.35256  10.58411     5

Dec 20 '21 18:12 hms1

Wow yes we have really been trying to think about the bottle necks, we have known that the processing was very slow, so this is really helpful! @michaelquinn32 thoughts? In the original code we have skim_by_type.grouped_df() and skim_by_type.data.frame() (among others). But I don't think in your code you are using groups? Are you suggesting change to several functions or that we drop the differentiation?

Dec 21 '21 06:12 elinw

Hey @elinw, yes I saw that in the original code skim_by_type is split into three (?) methods to handle grouped_df, data.table and plain data.frames. The single function above should handle each of these classes without issue but of course it could be split into different methods to handle each class separately. It just seemed easier to demonstrate like the above. If you are interested I'd be happy to formalise things in a new branch when I find the time but I don't permissions to do so at the moment.

Dec 21 '21 08:12 hms1

Well to create a new branch you would fork the project and create the branch in your repo. We use the S3 approach because, among other things, it lets people develop skim methods for different kinds of objects. Now that you have raised this I think we are really at the point of maturity of skimr where working on performance (and, related, pushing into a database when appropriate) is really the biggest thing we can do.

Dec 21 '21 15:12 elinw

That's awesome Henry. Thanks!

We would love a pull request. I would be happy to review your code.

As @elinw pointed out, create a branch with your change and we can go from there.

If you haven't yet, take a look at our contributor's guide
Code should pass a run of lintr with the default settings
And you should run styler::style_pkg() before we merge

I'm surprised that dplyr::across() performs so much worse than pivoting the data first, but I'm thrilled to have discovered this opportunity for improving the user experience.

Thanks!

Dec 21 '21 20:12 michaelquinn32

Hey @michaelquinn32 ,

I had a chance to look at this properly rather than just re-writing the function that was going slow.

First off I should clarify that it is not dplyr::across that is causing issues. I am afraid to even think what the punishment is for people who bring dplyr into disrepute.

Amazingly it is the call to as.data.frame buried inside skimr:::reshape_skimmed that causes the majority of slowness (see here). My explanation is that as.data.frame gets slower as the object being converted gets larger. In combination with the increased calls to reshape_skimmed (once per variable) this leads to an exponential performance degradation as you add columns.

Anyway, I have made an initial PR that simply changes as.data.frame to as_tibble. What I suggested above in the thread is faster and is quite a lot cleaner (in my opinion) so I can make a subsequent PR to rewrite the methods. Just this initial change is a very low hanging fruit.

Dec 23 '21 15:12 hms1

Let's look at the timing for both grouped and ungrouped data with the as_tibble change. I'm wondering what we are thinking right there that we used as.data.frame() -- was there some kind of edge case? It's right in the middle of a bunch of tidyverse code. This is the commit https://github.com/ropensci/skimr/commit/81b7ddd8011be8565371e0d4d84e5f2bc46ed15e which essentially added the as.data.frame().

Should we be coercing to a tibble after we check that the input to skim is a data frame or coercible to a data frame?

Dec 24 '21 04:12 elinw

Even when we merge that change, let's keep this issue open for the general topic of making skimr faster for different situations.

Jan 01 '22 03:01 elinw

Also let's consider performance without histograms as th base since we know they have a big impact.

Jan 01 '22 21:01 elinw

Hi, love skimr and use it a lot. Are there any updates on the performance issue? My datasets are 50-300+ vars and 8-60 mlns observations and skimr takes a long time to do summaries. Thank you!

Sep 02 '22 04:09 astraetech

Thanks for using skimr!

We do support summaries of data.table objects. Have you tried that? Is it faster?

Otherwise, scaling up to that size is a big challenge for the design of skimr, and I'm not exactly sure what we could do. We've both been very stretched for time recently, so a big change (like support parallel backends, etc.) might be a ways away still.

Sep 19 '22 23:09 michaelquinn32

skim takes about ten minutes for my dataset, a tibble with 150k rows and 1000 columns. This is a lot of columns but not unreasonable for raw medical data. For comparison, summary does not do as much, but finishes in 30 seconds.

Oct 23 '23 19:10 jxu

skimr skimr copied to clipboard

Improve performance (for big datasets)

skimr
skimr copied to clipboard