Performance of streaming requests
While refactoring the rnpn package to use httr2, I've discovered that streaming ndjson with req_perform_connection() and resp_stream_lines() takes more time and significantly memory compared to using curl + jsonlite::stream_in()—so much so that I'm going to have to revert the change as users are running up against memory limitations. I'm not sure if this is just because of additional overhead due to features of httr2 or if it is something that can be addressed (or possibly I'm doing things wrong!)
For example, a request that uses ~17MB of RAM with curl + jsonlite::stream_in() uses ~1GB of RAM with httr2.
Full benchmark code:
library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)
url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01",
end_date = "2025-12-31")
bench::mark(
httr2 = {
req <- httr2::request(url) %>%
httr2::req_method("POST") %>%
httr2::req_body_form(!!!query)
con <- httr2::req_perform_connection(req)
out_httr2 <- tibble::tibble()
while(!httr2::resp_stream_is_complete(con)) {
resp <- httr2::resp_stream_lines(con, lines = 5000)
df <- resp %>%
textConnection() %>%
jsonlite::stream_in(verbose = FALSE, pagesize = 5000)
out_httr2 <- dplyr::bind_rows(out_httr2, df)
}
close(con)
out_httr2
},
curl = {
query2 <- c(query, customrequest = "POST")
h <- new_handle() %>% handle_setform(.list = query2)
con <- curl(url, handle = h)
out_curl <- tibble::tibble()
jsonlite::stream_in(con, function(df) {
#I know this isn't necessary, but in the real code data wrangling happens
#in the callback function
out_curl <<- dplyr::bind_rows(out_curl, df)
}, verbose = FALSE, pagesize = 5000)
out_curl
}
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 httr2 14.6s 14.6s 0.0683 1.04GB 1.37
#> 2 curl 14.6s 14.6s 0.0687 17.79MB 0.0687
Created on 2025-03-13 with reprex v2.1.1
Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.4.3 (2025-02-28)
#> os macOS Sequoia 15.3.2
#> system x86_64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz America/Phoenix
#> date 2025-03-13
#> pandoc 3.6.2 @ /usr/local/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> bench * 1.1.4 2025-01-16 [1] CRAN (R 4.4.1)
#> cli 3.6.4 2025-02-13 [1] CRAN (R 4.4.1)
#> curl * 6.2.1 2025-02-19 [1] CRAN (R 4.4.1)
#> digest 0.6.37 2024-08-19 [1] CRAN (R 4.4.1)
#> dplyr 1.1.4 2023-11-17 [1] CRAN (R 4.4.0)
#> evaluate 1.0.3 2025-01-10 [1] CRAN (R 4.4.1)
#> fastmap 1.2.0 2024-05-15 [1] CRAN (R 4.4.0)
#> fs 1.6.5 2024-10-30 [1] CRAN (R 4.4.1)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.4.0)
#> glue 1.8.0 2024-09-30 [1] CRAN (R 4.4.1)
#> htmltools 0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#> httr2 * 1.1.1 2025-03-08 [1] CRAN (R 4.4.1)
#> jsonlite * 1.8.9 2024-09-20 [1] CRAN (R 4.4.1)
#> knitr 1.49 2024-11-08 [1] CRAN (R 4.4.1)
#> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.4.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.4.0)
#> pillar 1.10.1 2025-01-07 [1] CRAN (R 4.4.1)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.4.0)
#> profmem 0.6.0 2020-12-13 [1] CRAN (R 4.4.0)
#> R6 2.6.1 2025-02-15 [1] CRAN (R 4.4.1)
#> rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.4.0)
#> reprex 2.1.1 2024-07-06 [1] CRAN (R 4.4.0)
#> rlang 1.1.5 2025-01-17 [1] CRAN (R 4.4.1)
#> rmarkdown 2.29 2024-11-04 [1] CRAN (R 4.4.1)
#> rstudioapi 0.17.1 2024-10-22 [1] CRAN (R 4.4.1)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.4.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.4.0)
#> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.4.0)
#> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.4.0)
#> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.4.0)
#> withr 3.0.2 2024-10-28 [1] CRAN (R 4.4.1)
#> xfun 0.50 2025-01-07 [1] CRAN (R 4.4.1)
#> yaml 2.3.10 2024-07-26 [1] CRAN (R 4.4.0)
#>
#> [1] /Users/ericscott/Library/R/x86_64/4.4/library
#> [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Can you give me a bit more of a realistic use case? It doesn't seem like you get any benefit from streaming here, since you download and process every single line.
Hmmmm, maybe the different meanings of streaming in httr2 and curl are confusing here. I don't think your use case benefits from httr2 streaming.
It is a bit weird that this allocates so much memory though:
library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)
url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01",
end_date = "2025-12-31")
req <- httr2::request(url) %>%
httr2::req_method("POST") %>%
httr2::req_body_form(!!!query)
bench::mark(
httr2 = {
con <- httr2::req_perform_connection(req)
while(!httr2::resp_stream_is_complete(con)) {
resp <- httr2::resp_stream_lines(con, lines = 5000)
}
close(con)
}
)
The function I'm refactoring downloads potentially 300,000+ rows, does quite a bit of data wrangling to each chunk, and optionally writes each chunk to a file rather than rowbinding to an in-memory dataframe. Now all by the smallest queries seem to be having issues.
Users are now running into errors that seem to be due to running out of memory with requests that previously worked without writing to a file.
I didn't realize "streaming" had a different meaning here. I'm just hoping to get the ndjson a chunk at a time so I can wrangle it and optionally write it to a file. There might be a better approach though—I think I could only use streaming if an output file is specified and otherwise just read it all in one go, but I'm not sure that would solve the memory issue.
@Aariq thanks for the context, I'll take a look when I'm back from vacation. FWIW I'd highly recommend that you don't do iterative rowbinding as this is likely to be slow and cause a lot of memory allocations.
Some more code to help me understand what's going on:
library(httr2)
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
req_body_form(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
system.time(resp <- req_perform(req))
length(strsplit(resp_body_string(resp), "\n")[[1]])
stream_data <- function(req, lines) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_lines(con, lines = lines)
}
invisible()
}
batch_data <- function(req) {
resp <- req_perform(req)
resp_body_string(resp)
invisible()
}
bench::mark(
stream_data(req, 10),
stream_data(req, 100),
stream_data(req, 1000),
batch_data(req),
iterations = 1,
filter_gc = FALSE,
check = FALSE
)[1:5]
#> # A tibble: 4 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data(req, 10) 2.8s 2.8s 0.357 60MB
#> 2 stream_data(req, 100) 2.86s 2.86s 0.349 59.8MB
#> 3 stream_data(req, 1000) 2.82s 2.82s 0.354 61.1MB
#> 4 download_data(req) 2.56s 2.56s 0.390 423.6KB
So even with a smaller example, seeing a lot more memory churn going on. The churn doesn't seem to affect the overall speed that much and seems independent of the chunk size.
If I do some memory profilling with profvis:
profvis::profvis(stream_data(req, 100))
All of the allocation seems to be happening in readBin(), which I'm frankly surprised by because I wouldn't have thought that would allocate in R at all?
Ok, if I rewrite this in pure curl, I see the same memory allocation:
library(curl)
stream_data <- function() {
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
body_fields <- c(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
body <- charToRaw(paste0(paste0(names(body_fields), "=", body_fields), collapse = "&"))
h <- new_handle()
handle_setopt(h, post = TRUE, postfieldsize = length(body), postfields = body)
con <- curl(url, handle = h)
open(con, "rbf", blocking = FALSE)
on.exit(con)
while(isIncomplete(con)) {
readBin(con, raw(), 10 * 1024)
}
close(con)
invisible()
}
bench::mark(stream_data(), iterations = 1, filter_gc = FALSE)[1:5]
#> # A tibble: 1 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data() 2.82s 2.82s 0.354 227MB
I've forwarded this to @jeroen to take a look at, and since it doesn't appear to be a httr2 issue, I'm going to remove this from the milestone.
And closing here since it's now tracked in curl.
@Aariq this should be fixed in curl 6.2.3. You can install the dev version from r-universe:
install.packages("curl", repos = "https://jeroen.r-universe.dev")
This is no longer a problem in curl, but it looks like we still have some work to do in httr2:
library(httr2)
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
req_body_form(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
# system.time(resp <- req_perform(req))
# length(strsplit(resp_body_string(resp), "\n")[[1]])
stream_data <- function(req, lines) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_lines(con, lines = 100)
}
invisible()
}
stream_data_raw <- function(req) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_raw(con, kb = 1)
}
invisible()
}
bench::mark(
stream_data(req),
stream_data_raw(req),
iterations = 1,
filter_gc = FALSE,
check = FALSE
)[1:5]
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data(req) 5.17s 5.17s 0.194 139MB
#> 2 stream_data_raw(req) 5.51s 5.51s 0.181 975KB
Created on 2025-04-01 with reprex v2.1.1
Fixing the memory allocations is going to require a couple of hours of work. Will first need to create a ring buffer implementation so that we can retrieve and use raw bytes from the connection without allocation memory. Then need to rewrite the event boundary functions to work with some sort of callback function on the ring buffer.
With ~4 hours work:
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 stream_data(req) 6.83s 6.83s 0.147 43.03MB
2 stream_data_raw(req) 5.38s 5.38s 0.186 1.28MB
I thought it would be a bigger difference 😬 Also it's way slower than the previous code 😞
Hi. Any update on this one maybe ?
@Seb-FS-Axpo did you try with the new curl as suggested in https://github.com/r-lib/httr2/issues/704#issuecomment-2768712657
Hi @jeroen thansk for the quick feedback. I was hoping to keep using htt2 implementation and was hoping for more feedback based on https://github.com/r-lib/httr2/issues/704#issuecomment-2770711117
@Seb-FS-Axpo httr2 is based on curl. If you upgrade curl, the problem will be fixed in httr2 too.
@jeroen I think there's still work to do in httr2, since we also do buffering that seems to be creating a bunch of copies.