frictionless-r
frictionless-r copied to clipboard
Replace httr with httr2
The httr package is a dependency of frictionless: https://github.com/frictionlessdata/frictionless-r/blob/b0a9fe34b26fe120309899503bba67cffcb710ff/DESCRIPTION#L33.
However, httr is superseded. As this should not create a big issue, for long term development it could be better to switch to httr2. From doc:
httr2 is a ground-up rewrite of httr that provides a pipeable API with an explicit request object that solves more problems felt by packages that wrap APIs (e.g. built-in rate-limiting, retries, OAuth, secure secrets, and more).
We only use httr::http_error()
to check if something can be found at a url (path):
https://github.com/frictionlessdata/frictionless-r/blob/772862a595a1861eba91a64d1db7952ac63c7e7e/R/utils.R#L107-L110
I don't find a straightforward replacement for that in httr2
. We could also try with the underlying curl
package ...
In RCurl
there's a function that has the same behaviour, but RCurl
is not advised by rOpenSci:
library(RCurl)
url.exists("https://example.com/404")
#> [1] FALSE
Created on 2023-06-20 with reprex v2.0.2
I think httr2 is still an option because it handles errors way more nicely van httr. See example in https://httr2.r-lib.org/articles/wrapping-apis.html#errors which I think could be useful in our case.
There are a number of options, a small demonstration:
# setup
library(tidyverse)
bad_url <- "https://docs.ropensci.org/frictionless/not_a_valid_page"
# 1. standard behaviour of httr2 is to return an error on http errors
httr2::request(bad_url) %>% httr2::req_perform()
#> Error in `httr2::req_perform()`:
#> ! HTTP 404 Not Found.
# 2. But you can overwrite this behaviour and pass your own additional error
# information, perhaps returned from the response itself
httr2::request(bad_url) %>%
httr2::req_error(
body = ~ glue::glue("Can't find file at `{path}`.",
path = bad_url
)
) %>%
httr2::req_perform()
#> Error in `httr2::req_perform()`:
#> ! HTTP 404 Not Found.
#> • Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.
# 3. You don't need assertthat to do this in httr either:
httr::stop_for_status(
httr::HEAD(bad_url),
task = glue::glue("find file at `{path}`.",
path = bad_url
)
)
#> Error: Not Found (HTTP 404). Failed to find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`..
# 4. or we can suppress this behaviour completely
assertthat::assert_that(
!httr2::resp_is_error(
httr2::request(bad_url) %>%
httr2::req_error(
is_error = function(response) FALSE
) %>%
httr2::req_perform()
),
msg = glue::glue("Can't find file at `{path}`.",
path = bad_url
)
)
#> Error: Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.
## 5. or without assertthat
httr2::request(bad_url) %>%
httr2::req_error(is_error = function(response) FALSE) %>%
httr2::req_perform() %>%
httr2::resp_check_status(
info =
glue::glue("Can't find file at `{path}`.",
path = bad_url
)
)
#> Error:
#> ! HTTP 404 Not Found.
#> • Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.
Do you think it's beneficial to return the http error itself?
The implementation would be:
if (is_url(path)) {
response <- httr2::req_perform(
httr2::req_error(httr2::request(path), is_error = function(resp) FALSE)
) # req_error() is used to silence R error
assertthat::assert_that(
!httr2::resp_is_error(response),
msg = glue::glue("Can't find file at `{path}`.")
)
So that is four httr2
functions rather than one httr
function. And httr2
has more dependencies. I would therefore not move to httr2
.
If someone can replicate this behaviour in underlying curl
that we could use that approach.
It's pretty doable as long as you assume all HTTP status codes under 300 are good, and all above are bad. It's up to us how complex we want to make this.
There you go, I've also snuk in an example of using purrr adverbs in the tests:
# stopping on http error status using curl
library(testthat)
# setting urls to test ----------------------------------------------------
urls_to_test <- list(
bad_frictionless = "https://docs.ropensci.org/frictionless/not_a_valid_page",
good_curl = "https://jeroen.r-universe.dev/curl",
bad_httpbin_404 = "http://httpbin.org/status/404",
good_httpbin_201 = "http://httpbin.org/status/201",
bad_httpbin_504 = "http://httpbin.org/status/504",
bad_httpbin_429 = "http://httpbin.org/status/429",
good_httpbin_200 = "http://httpbin.org/status/200"
)
# define function to stop on status ---------------------------------------
stop_on_status <- function(url) {
assertthat::assert_that(
curl::curl_fetch_memory(url)$status_code < 300,
msg = glue::glue("Can't find file at `{path}`.",
path = url
)
)
}
# test function, continue on error ----------------------------------------
test_result <-
purrr::map(urls_to_test, purrr::safely(stop_on_status)) %>%
purrr::set_names(names(urls_to_test))
good_url_results <-
test_result[stringr::str_starts(names(test_result), "good")]
bad_url_results <-
test_result[stringr::str_starts(names(test_result), "bad")]
test_that("all good urls return TRUE", {
expect_true(
all(
purrr::map_lgl(
good_url_results,
~ rlang::is_true(
purrr::pluck(.x, "result")
)
)
)
)
})
#> Test passed 🥳
test_that("all bad urls return errors", {
expect_true(
all(
purrr::map_lgl(
bad_url_results,
~ !rlang::is_null(
purrr::pluck(.x, "error")
)
)
)
)
})
#> Test passed 🌈
test_that("a bad url returns the right error message", {
expect_error(
stop_on_status("https://docs.ropensci.org/vcr/not_a_url"),
regexp = "Can't find file at `https://docs.ropensci.org/vcr/not_a_url`.",
fixed = TRUE
)
})
#> Test passed 🌈
# proposed implementation example -----------------------------------------
path <- "https://docs.ropensci.org/frictionless/not_a_valid_page"
is_url <- function(...) {
frictionless:::is_url(...)
}
if (is_url(path)) {
stop_on_status(path)
}
#> Error: Can't find file at `https://docs.ropensci.org/frictionless/not_a_valid_page`.
# alternativly you could place is_url within stop_on_status
Created on 2023-06-22 with reprex v2.0.2
Barely worth placing in a helper, but I'd still do it to keep it DRY. It becomes slightly more verbose if you want to include nice human readable http error messages like Not Found (HTTP 404)
as you need to include a lookup table in the function. I don't think we want to completely reverse engineer httr::stop_for_status()
?
Nice!
- No, I don't think we want to completely reverse
httr::stop_for_status()
. If anything, it would behttr::http_error()
- Are 300 codes bad?
- I think the
is_url()
function needs to be kept separate, to differentiate paths from urls - Not sure we need a helper function, but we do have one for
is_url
, so we could have a super simple one (name inspired byurl.exists()
):
url_exists <- function(url) {
curl::curl_fetch_memory(url)$status_code < 300
}
Which is then used in the current code as:
# Check existence of file at path
if (is_url(path)) {
assertthat::assert_that(
url_exists(path),
msg = glue::glue("Can't find file at `{path}`.")
)
} else {
assertthat::assert_that(
file.exists(path),
msg = glue::glue("Can't find file at `{path}`.")
)
}
return(path)
}
Are 300 codes bad?
They are a call to action, usually meaning redirects.
httr::http_error()
uses 400 as a cutoff:
TRUE
if the request fails (status code 400 or above), otherwiseFALSE
.
so I suggest we do the same.
Also we might want to capture curl errors as well, otherwise you'll get behaviour like this:
path <- "http://not.a.url"
is_url <- function(...){
frictionless:::is_url(...)
}
url_exists <- function(url) {
curl::curl_fetch_memory(url)$status_code < 300
}
if (is_url(path)) {
assertthat::assert_that(
url_exists(path),
msg = glue::glue("Can't find file at `{path}`.")
)
} else {
assertthat::assert_that(
file.exists(path),
msg = glue::glue("Can't find file at `{path}`.")
)
}
#> Error in curl::curl_fetch_memory(url): Could not resolve host: not.a.url
Created on 2023-06-23 with reprex v2.0.2
Agreed on using < 400
The httr::http_error()
we currently use lets those curl errors bleed through, so I think it is fine if we keep that. They can be informative to the user.
curl::curl_fetch_memory("")$status
# Error in curl::curl_fetch_memory("") :
# URL using bad/illegal format or missing URL
httr::http_error("")
# Error in curl::curl_fetch_memory(url, handle = handle) :
# URL using bad/illegal format or missing URL
curl::curl_fetch_memory("https://not.a.host")$status
# Error in curl::curl_fetch_memory("https://not.a.host") :
# Could not resolve host: not.a.host
httr::http_error("https://not.a.host")
# Error in curl::curl_fetch_memory(url, handle = handle) :
# Could not resolve host: not.a.host
curl::curl_fetch_memory("https://example.com")$status
# [1] 200
httr::http_error("https://example.com")
# [1] FALSE
Closing issue. check_path()
provides meaningful errors and http errors are not too bad to understand.