jsonlite
jsonlite copied to clipboard
Improve performance of transpose_list()
This improves the runtime significantly for loading data with many columns. The order of loop nesting as well as a much more efficient binary search does the trick.
In a real world example, fetching ~300k rows with ~50 columns from MongoDB, this brings the query + load time from 70 seconds to ~40. Used to be: ~10 seconds query, ~30 seconds transpose_list, and ~30 seconds simplifying colums. The transpose_list now takes <2 seconds.
Microbenchmark with synthetic data on an AMD 5950X, 128GB RAM, Fedora Linux 36, R 4.1.3, jsonlite 1.8.0.9000 commit 80854359
> set.seed(1)
> rows <- 10000
> columns <- 100
> p_missing <- 0.2
>
> recordlist <- lapply(1:rows, function(rownum) {
+ row <- as.list(1:columns)
+ names(row) <- paste0("col_", row)
+ row[runif(columns) > p_missing]
+ })
> columns <- unique(unlist(lapply(recordlist, names), recursive = FALSE,
+ use.names = FALSE))
Before this change
> microbenchmark::microbenchmark(
+ jsonlite:::transpose_list(recordlist, columns),
+ times = 10
+ )
Unit: milliseconds
expr min lq mean median uq max neval
jsonlite:::transpose_list(recordlist, columns) 577.8338 589.4064 593.0518 591.6895 599.4221 607.3057 10
With this change
> microbenchmark::microbenchmark(
+ jsonlite:::transpose_list(recordlist, columns),
+ times = 10
+ )
Unit: milliseconds
expr min lq mean median uq max neval
jsonlite:::transpose_list(recordlist, columns) 41.37537 43.22655 43.88987 43.76705 45.43552 46.81052 10