xml2 icon indicating copy to clipboard operation
xml2 copied to clipboard

url_absolute fails with spaces in url

Open jonthegeek opened this issue 10 months ago • 0 comments

xml_attr(x, "href") returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute.

url <- "/filename with spaces.pdf" 
xml2::url_absolute(
  url,
  base = "https://example.com/"
)
#> [1] NA
xml2::url_absolute(
  utils::URLencode(url),
  base = "https://example.com/"
)
#> [1] "https://example.com/filename%20with%20spaces.pdf"

Created on 2023-08-23 with reprex v2.0.2

url_absolute() gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.

This is where I found it in the wild:

base_url <- "https://www.copyright.gov/fair-use/fair-index.html"

pdf_urls <-
  rvest::read_html(base_url) |> 
  rvest::html_element("table") |> 
  rvest::html_elements("tr>td:first-of-type>a:first-of-type") |>
  rvest::html_attr("href")

pdf_urls[[10]] |> 
  rvest::url_absolute(base_url)
#> [1] NA

pdf_urls[[10]] |> 
  utils::URLencode() |> 
  rvest::url_absolute(base_url)
#> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"

Created on 2023-08-23 with reprex v2.0.2

jonthegeek avatar Aug 23 '23 15:08 jonthegeek