frictionless-r icon indicating copy to clipboard operation
frictionless-r copied to clipboard

Replace httr with httr2

Open damianooldoni opened this issue 1 year ago • 9 comments

The httr package is a dependency of frictionless: https://github.com/frictionlessdata/frictionless-r/blob/b0a9fe34b26fe120309899503bba67cffcb710ff/DESCRIPTION#L33.

However, httr is superseded. As this should not create a big issue, for long term development it could be better to switch to httr2. From doc:

httr2 is a ground-up rewrite of httr that provides a pipeable API with an explicit request object that solves more problems felt by packages that wrap APIs (e.g. built-in rate-limiting, retries, OAuth, secure secrets, and more).

damianooldoni avatar Jun 19 '23 10:06 damianooldoni

We only use httr::http_error() to check if something can be found at a url (path):

https://github.com/frictionlessdata/frictionless-r/blob/772862a595a1861eba91a64d1db7952ac63c7e7e/R/utils.R#L107-L110

I don't find a straightforward replacement for that in httr2. We could also try with the underlying curl package ...

peterdesmet avatar Jun 19 '23 16:06 peterdesmet

In RCurl there's a function that has the same behaviour, but RCurl is not advised by rOpenSci:

library(RCurl)
url.exists("https://example.com/404")
#> [1] FALSE

Created on 2023-06-20 with reprex v2.0.2

peterdesmet avatar Jun 20 '23 07:06 peterdesmet

I think httr2 is still an option because it handles errors way more nicely van httr. See example in https://httr2.r-lib.org/articles/wrapping-apis.html#errors which I think could be useful in our case.

damianooldoni avatar Jun 20 '23 08:06 damianooldoni

There are a number of options, a small demonstration:

# setup
library(tidyverse)
bad_url <- "https://docs.ropensci.org/frictionless/not_a_valid_page"

# 1. standard behaviour of httr2 is to return an error on http errors

httr2::request(bad_url) %>% httr2::req_perform()
#> Error in `httr2::req_perform()`:
#> ! HTTP 404 Not Found.

# 2. But you can overwrite this behaviour and pass your own additional error
# information, perhaps returned from the response itself

httr2::request(bad_url) %>%
  httr2::req_error(
    body = ~ glue::glue("Can't find file at `{path}`.",
      path = bad_url
    )
  ) %>%
  httr2::req_perform()
#> Error in `httr2::req_perform()`:
#> ! HTTP 404 Not Found.
#> • Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.


# 3. You don't need assertthat to do this in httr either:

httr::stop_for_status(
  httr::HEAD(bad_url),
  task = glue::glue("find file at `{path}`.",
    path = bad_url
  )
)
#> Error: Not Found (HTTP 404). Failed to find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`..


# 4. or we can suppress this behaviour completely
assertthat::assert_that(
  !httr2::resp_is_error(
    httr2::request(bad_url) %>%
      httr2::req_error(
        is_error = function(response) FALSE
      ) %>%
      httr2::req_perform()
  ),
  msg = glue::glue("Can't find file at `{path}`.",
    path = bad_url
  )
)
#> Error: Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.

## 5. or without assertthat

httr2::request(bad_url) %>%
  httr2::req_error(is_error = function(response) FALSE) %>%
  httr2::req_perform() %>%
  httr2::resp_check_status(
    info =
      glue::glue("Can't find file at `{path}`.",
        path = bad_url
      )
  )
#> Error:
#> ! HTTP 404 Not Found.
#> • Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.

Do you think it's beneficial to return the http error itself?

PietrH avatar Jun 20 '23 09:06 PietrH

The implementation would be:

if (is_url(path)) {
    response <- httr2::req_perform(
      httr2::req_error(httr2::request(path), is_error = function(resp) FALSE)
    ) # req_error() is used to silence R error
    assertthat::assert_that(
      !httr2::resp_is_error(response),
      msg = glue::glue("Can't find file at `{path}`.")
    )

So that is four httr2 functions rather than one httr function. And httr2 has more dependencies. I would therefore not move to httr2.

If someone can replicate this behaviour in underlying curl that we could use that approach.

peterdesmet avatar Jun 20 '23 12:06 peterdesmet

It's pretty doable as long as you assume all HTTP status codes under 300 are good, and all above are bad. It's up to us how complex we want to make this.

There you go, I've also snuk in an example of using purrr adverbs in the tests:

# stopping on http error status using curl

library(testthat)

# setting urls to test ----------------------------------------------------


urls_to_test <- list(
  bad_frictionless = "https://docs.ropensci.org/frictionless/not_a_valid_page",
  good_curl = "https://jeroen.r-universe.dev/curl",
  bad_httpbin_404 = "http://httpbin.org/status/404",
  good_httpbin_201 = "http://httpbin.org/status/201",
  bad_httpbin_504 = "http://httpbin.org/status/504",
  bad_httpbin_429 = "http://httpbin.org/status/429",
  good_httpbin_200 = "http://httpbin.org/status/200"
)


# define function to stop on status ---------------------------------------


stop_on_status <- function(url) {
  assertthat::assert_that(
    curl::curl_fetch_memory(url)$status_code < 300,
    msg = glue::glue("Can't find file at `{path}`.",
      path = url
    )
  )
}


# test function, continue on error ----------------------------------------

test_result <-
  purrr::map(urls_to_test, purrr::safely(stop_on_status)) %>%
  purrr::set_names(names(urls_to_test))

good_url_results <-
  test_result[stringr::str_starts(names(test_result), "good")]

bad_url_results <-
  test_result[stringr::str_starts(names(test_result), "bad")]

test_that("all good urls return TRUE", {
  expect_true(
    all(
      purrr::map_lgl(
        good_url_results,
        ~ rlang::is_true(
          purrr::pluck(.x, "result")
        )
      )
    )
  )
})
#> Test passed 🥳

test_that("all bad urls return errors", {
  expect_true(
    all(
      purrr::map_lgl(
        bad_url_results,
        ~ !rlang::is_null(
          purrr::pluck(.x, "error")
        )
      )
    )
  )
})
#> Test passed 🌈

test_that("a bad url returns the right error message", {
  expect_error(
    stop_on_status("https://docs.ropensci.org/vcr/not_a_url"),
    regexp = "Can't find file at `https://docs.ropensci.org/vcr/not_a_url`.",
    fixed = TRUE
  )
})
#> Test passed 🌈
# proposed implementation example -----------------------------------------
path <- "https://docs.ropensci.org/frictionless/not_a_valid_page"
is_url <- function(...) {
  frictionless:::is_url(...)
}

if (is_url(path)) {
  stop_on_status(path)
}
#> Error: Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.

# alternativly you could place is_url within stop_on_status

Created on 2023-06-22 with reprex v2.0.2

Barely worth placing in a helper, but I'd still do it to keep it DRY. It becomes slightly more verbose if you want to include nice human readable http error messages like Not Found (HTTP 404) as you need to include a lookup table in the function. I don't think we want to completely reverse engineer httr::stop_for_status() ?

PietrH avatar Jun 22 '23 08:06 PietrH

Nice!

  1. No, I don't think we want to completely reverse httr::stop_for_status(). If anything, it would be httr::http_error()
  2. Are 300 codes bad?
  3. I think the is_url() function needs to be kept separate, to differentiate paths from urls
  4. Not sure we need a helper function, but we do have one for is_url, so we could have a super simple one (name inspired by url.exists()):
url_exists <- function(url) {
  curl::curl_fetch_memory(url)$status_code < 300
}

Which is then used in the current code as:

  # Check existence of file at path
  if (is_url(path)) {
    assertthat::assert_that(
      url_exists(path),
      msg = glue::glue("Can't find file at `{path}`.")
    )
  } else {
    assertthat::assert_that(
      file.exists(path),
      msg = glue::glue("Can't find file at `{path}`.")
    )
  }
  return(path)
}

peterdesmet avatar Jun 23 '23 12:06 peterdesmet

Are 300 codes bad?

They are a call to action, usually meaning redirects.

httr::http_error() uses 400 as a cutoff:

TRUE if the request fails (status code 400 or above), otherwise FALSE.

so I suggest we do the same.

Also we might want to capture curl errors as well, otherwise you'll get behaviour like this:

path <- "http://not.a.url"

is_url <- function(...){
  frictionless:::is_url(...)
}

url_exists <- function(url) {
  curl::curl_fetch_memory(url)$status_code < 300
}

if (is_url(path)) {
  assertthat::assert_that(
    url_exists(path),
    msg = glue::glue("Can't find file at `{path}`.")
  )
} else {
  assertthat::assert_that(
    file.exists(path),
    msg = glue::glue("Can't find file at `{path}`.")
  )
}
#> Error in curl::curl_fetch_memory(url): Could not resolve host: not.a.url

Created on 2023-06-23 with reprex v2.0.2

PietrH avatar Jun 23 '23 13:06 PietrH

Agreed on using < 400

The httr::http_error() we currently use lets those curl errors bleed through, so I think it is fine if we keep that. They can be informative to the user.

curl::curl_fetch_memory("")$status
# Error in curl::curl_fetch_memory("") : 
#   URL using bad/illegal format or missing URL
httr::http_error("")
# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   URL using bad/illegal format or missing URL

curl::curl_fetch_memory("https://not.a.host")$status
# Error in curl::curl_fetch_memory("https://not.a.host") : 
#   Could not resolve host: not.a.host
httr::http_error("https://not.a.host")
# Error in curl::curl_fetch_memory(url, handle = handle) : 
#   Could not resolve host: not.a.host

curl::curl_fetch_memory("https://example.com")$status
# [1] 200
httr::http_error("https://example.com")
# [1] FALSE

peterdesmet avatar Jun 23 '23 14:06 peterdesmet

Closing issue. check_path() provides meaningful errors and http errors are not too bad to understand.

peterdesmet avatar Aug 21 '24 15:08 peterdesmet