httr2 icon indicating copy to clipboard operation
httr2 copied to clipboard

Performance of streaming requests

Open Aariq opened this issue 10 months ago • 17 comments

While refactoring the rnpn package to use httr2, I've discovered that streaming ndjson with req_perform_connection() and resp_stream_lines() takes more time and significantly memory compared to using curl + jsonlite::stream_in()—so much so that I'm going to have to revert the change as users are running up against memory limitations. I'm not sure if this is just because of additional overhead due to features of httr2 or if it is something that can be addressed (or possibly I'm doing things wrong!)

For example, a request that uses ~17MB of RAM with curl + jsonlite::stream_in() uses ~1GB of RAM with httr2.

Full benchmark code:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

bench::mark(
  httr2 = {
    req <- httr2::request(url) %>%
      httr2::req_method("POST") %>%
      httr2::req_body_form(!!!query)
    
    con <- httr2::req_perform_connection(req)
    out_httr2 <- tibble::tibble()
    
    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
      df <- resp %>% 
        textConnection() %>% 
        jsonlite::stream_in(verbose = FALSE, pagesize = 5000)
      out_httr2 <- dplyr::bind_rows(out_httr2, df)
    }
    close(con)
    out_httr2
  },
  
  curl = {
    query2 <- c(query, customrequest = "POST")
    h <- new_handle() %>% handle_setform(.list = query2)
    
    con <- curl(url, handle = h)
    out_curl <- tibble::tibble()
    
    jsonlite::stream_in(con, function(df) {
      #I know this isn't necessary, but in the real code data wrangling happens
      #in the callback function
      out_curl <<- dplyr::bind_rows(out_curl, df) 
    }, verbose = FALSE, pagesize = 5000)
    out_curl
  }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 httr2         14.6s    14.6s    0.0683    1.04GB   1.37  
#> 2 curl          14.6s    14.6s    0.0687   17.79MB   0.0687

Created on 2025-03-13 with reprex v2.1.1

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.3 (2025-02-28)
#>  os       macOS Sequoia 15.3.2
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Phoenix
#>  date     2025-03-13
#>  pandoc   3.6.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench       * 1.1.4   2025-01-16 [1] CRAN (R 4.4.1)
#>  cli           3.6.4   2025-02-13 [1] CRAN (R 4.4.1)
#>  curl        * 6.2.1   2025-02-19 [1] CRAN (R 4.4.1)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr         1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      1.0.3   2025-01-10 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  httr2       * 1.1.1   2025-03-08 [1] CRAN (R 4.4.1)
#>  jsonlite    * 1.8.9   2024-09-20 [1] CRAN (R 4.4.1)
#>  knitr         1.49    2024-11-08 [1] CRAN (R 4.4.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  pillar        1.10.1  2025-01-07 [1] CRAN (R 4.4.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.4.0)
#>  R6            2.6.1   2025-02-15 [1] CRAN (R 4.4.1)
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.5   2025-01-17 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.1)
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.1)
#>  xfun          0.50    2025-01-07 [1] CRAN (R 4.4.1)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Users/ericscott/Library/R/x86_64/4.4/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

Aariq avatar Mar 13 '25 23:03 Aariq

Can you give me a bit more of a realistic use case? It doesn't seem like you get any benefit from streaming here, since you download and process every single line.

hadley avatar Mar 13 '25 23:03 hadley

Hmmmm, maybe the different meanings of streaming in httr2 and curl are confusing here. I don't think your use case benefits from httr2 streaming.

It is a bit weird that this allocates so much memory though:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

req <- httr2::request(url) %>%
  httr2::req_method("POST") %>%
  httr2::req_body_form(!!!query)

bench::mark(
  httr2 = {
    con <- httr2::req_perform_connection(req)

    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
    }
    close(con)    
  }
)

hadley avatar Mar 13 '25 23:03 hadley

The function I'm refactoring downloads potentially 300,000+ rows, does quite a bit of data wrangling to each chunk, and optionally writes each chunk to a file rather than rowbinding to an in-memory dataframe. Now all by the smallest queries seem to be having issues.

Users are now running into errors that seem to be due to running out of memory with requests that previously worked without writing to a file.

I didn't realize "streaming" had a different meaning here. I'm just hoping to get the ndjson a chunk at a time so I can wrangle it and optionally write it to a file. There might be a better approach though—I think I could only use streaming if an output file is specified and otherwise just read it all in one go, but I'm not sure that would solve the memory issue.

Aariq avatar Mar 14 '25 00:03 Aariq

@Aariq thanks for the context, I'll take a look when I'm back from vacation. FWIW I'd highly recommend that you don't do iterative rowbinding as this is likely to be slow and cause a lot of memory allocations.

hadley avatar Mar 14 '25 21:03 hadley

Some more code to help me understand what's going on:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
system.time(resp <- req_perform(req))
length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = lines)
  }

  invisible()
}
batch_data <- function(req) {
  resp <- req_perform(req)
  resp_body_string(resp)
  invisible()
}

bench::mark(
  stream_data(req, 10),
  stream_data(req, 100),
  stream_data(req, 1000),
  batch_data(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 4 × 5
#> expression                  min   median `itr/sec` mem_alloc
#> <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req, 10)       2.8s     2.8s     0.357      60MB
#> 2 stream_data(req, 100)     2.86s    2.86s     0.349    59.8MB
#> 3 stream_data(req, 1000)    2.82s    2.82s     0.354    61.1MB
#> 4 download_data(req)        2.56s    2.56s     0.390   423.6KB

So even with a smaller example, seeing a lot more memory churn going on. The churn doesn't seem to affect the overall speed that much and seems independent of the chunk size.

If I do some memory profilling with profvis:

profvis::profvis(stream_data(req, 100))

All of the allocation seems to be happening in readBin(), which I'm frankly surprised by because I wouldn't have thought that would allocate in R at all?

hadley avatar Mar 24 '25 22:03 hadley

Ok, if I rewrite this in pure curl, I see the same memory allocation:

library(curl)

stream_data <- function() {
  url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
  body_fields <- c(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
  body <- charToRaw(paste0(paste0(names(body_fields), "=", body_fields), collapse = "&"))

  h <- new_handle()
  handle_setopt(h, post = TRUE, postfieldsize = length(body), postfields = body)
  
  con <- curl(url, handle = h)
  open(con, "rbf", blocking = FALSE)
  on.exit(con)
  
  while(isIncomplete(con)) {
    readBin(con, raw(), 10 * 1024)
  }
  
  close(con)
  invisible()
}

bench::mark(stream_data(), iterations = 1, filter_gc = FALSE)[1:5]
#> # A tibble: 1 × 5
#>   expression         min   median `itr/sec` mem_alloc
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data()    2.82s    2.82s     0.354     227MB

hadley avatar Mar 24 '25 22:03 hadley

I've forwarded this to @jeroen to take a look at, and since it doesn't appear to be a httr2 issue, I'm going to remove this from the milestone.

hadley avatar Mar 24 '25 22:03 hadley

And closing here since it's now tracked in curl.

hadley avatar Mar 26 '25 12:03 hadley

@Aariq this should be fixed in curl 6.2.3. You can install the dev version from r-universe:

install.packages("curl", repos = "https://jeroen.r-universe.dev")

jeroen avatar Apr 01 '25 09:04 jeroen

This is no longer a problem in curl, but it looks like we still have some work to do in httr2:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
# system.time(resp <- req_perform(req))
# length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = 100)
  }

  invisible()
}

stream_data_raw <- function(req) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_raw(con, kb = 1)
  }

  invisible()
}

bench::mark(
  stream_data(req),
  stream_data_raw(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 2 × 5
#>   expression                min   median `itr/sec` mem_alloc
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req)        5.17s    5.17s     0.194     139MB
#> 2 stream_data_raw(req)    5.51s    5.51s     0.181     975KB

Created on 2025-04-01 with reprex v2.1.1

hadley avatar Apr 01 '25 13:04 hadley

Fixing the memory allocations is going to require a couple of hours of work. Will first need to create a ring buffer implementation so that we can retrieve and use raw bytes from the connection without allocation memory. Then need to rewrite the event boundary functions to work with some sort of callback function on the ring buffer.

hadley avatar Apr 01 '25 13:04 hadley

With ~4 hours work:

  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 stream_data(req)        6.83s    6.83s     0.147   43.03MB
2 stream_data_raw(req)    5.38s    5.38s     0.186    1.28MB

I thought it would be a bigger difference 😬 Also it's way slower than the previous code 😞

hadley avatar Apr 01 '25 21:04 hadley

Hi. Any update on this one maybe ?

Seb-FS-Axpo avatar Apr 22 '25 10:04 Seb-FS-Axpo

@Seb-FS-Axpo did you try with the new curl as suggested in https://github.com/r-lib/httr2/issues/704#issuecomment-2768712657

jeroen avatar Apr 22 '25 10:04 jeroen

Hi @jeroen thansk for the quick feedback. I was hoping to keep using htt2 implementation and was hoping for more feedback based on https://github.com/r-lib/httr2/issues/704#issuecomment-2770711117

Seb-FS-Axpo avatar Apr 22 '25 11:04 Seb-FS-Axpo

@Seb-FS-Axpo httr2 is based on curl. If you upgrade curl, the problem will be fixed in httr2 too.

jeroen avatar Apr 22 '25 12:04 jeroen

@jeroen I think there's still work to do in httr2, since we also do buffering that seems to be creating a bunch of copies.

hadley avatar Apr 22 '25 16:04 hadley