jsonlite icon indicating copy to clipboard operation
jsonlite copied to clipboard

Improve performance of transpose_list()

Open halhen opened this issue 2 years ago • 0 comments

This improves the runtime significantly for loading data with many columns. The order of loop nesting as well as a much more efficient binary search does the trick.

In a real world example, fetching ~300k rows with ~50 columns from MongoDB, this brings the query + load time from 70 seconds to ~40. Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30 seconds simplifying colums. The transpose_list now takes <2 seconds.

Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 80854359

> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+   row <- as.list(1:columns)
+   names(row) <- paste0("col_", row)
+   row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+                          use.names = FALSE))

Before this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057    10

With this change

> microbenchmark::microbenchmark(
+     jsonlite:::transpose_list(recordlist, columns),
+     times = 10
+ )
Unit: milliseconds
                                           expr      min       lq     mean   median       uq      max neval
 jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052    10

halhen avatar Jul 14 '22 23:07 halhen