xml2
xml2 copied to clipboard
url_absolute fails with spaces in url
xml_attr(x, "href")
returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute
.
url <- "/filename with spaces.pdf"
xml2::url_absolute(
url,
base = "https://example.com/"
)
#> [1] NA
xml2::url_absolute(
utils::URLencode(url),
base = "https://example.com/"
)
#> [1] "https://example.com/filename%20with%20spaces.pdf"
Created on 2023-08-23 with reprex v2.0.2
url_absolute()
gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.
This is where I found it in the wild:
base_url <- "https://www.copyright.gov/fair-use/fair-index.html"
pdf_urls <-
rvest::read_html(base_url) |>
rvest::html_element("table") |>
rvest::html_elements("tr>td:first-of-type>a:first-of-type") |>
rvest::html_attr("href")
pdf_urls[[10]] |>
rvest::url_absolute(base_url)
#> [1] NA
pdf_urls[[10]] |>
utils::URLencode() |>
rvest::url_absolute(base_url)
#> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"
Created on 2023-08-23 with reprex v2.0.2