polite
polite copied to clipboard
More control over handling non-200 responses when scraping
I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite
is great project and I'd like to see it used more widely.
With httr
you can ask for the response code from a GET
request to a URL, and then choose what action to take if, for example, the code is ! == 200
. polite::scrape
uses httr
I believe, but handles the response internally, choosing to return NULL
from a 404 for example. I'm wondering if it could be made less opinionated.
Here's a scraping script I wrote the other day, using purrr::map_dfr
to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL
value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly
(ex 3 below) or maybe by just using map
with a reduce(bind_rows)
... but it might be good if polite
gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL
.
I hope that makes sense. Here's my examples:
library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)
url_root <- "https://www.ongelukvandaag.nl/archief/"
# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404
session <- polite::bow(
url = url_root,
user_agent = "Francis Barton [email protected]",
delay = 3
)
function 1
scrape_page <- function(url) {
page_text <- polite::nod(session, url) %>%
polite::scrape(accept = "html", verbose = TRUE)
headings <- page_text %>%
rvest::html_nodes("h2") %>%
rvest::html_text()
dates <- page_text %>%
rvest::html_nodes(".text-muted") %>%
rvest::html_text() %>%
stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")
dplyr::tibble(headings = headings, dates = dates)
}
# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"
function 2 - includes failsafe for 404s/NULL returns
scrape_page_safe <- function(url) {
failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
page_text <- polite::nod(session, url) %>%
polite::scrape(accept = "html")
if (is.null(page_text)) {
failsafe_tbl
} else {
headings <- page_text %>%
rvest::html_nodes("h2") %>%
rvest::html_text()
dates <- page_text %>%
rvest::html_nodes(".text-muted") %>%
rvest::html_text() %>%
stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")
dplyr::tibble(headings = headings, dates = dates)
}
}
# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#> headings dates
#> <chr> <chr>
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot. 10-01-20~
#> 3 <NA> <NA>
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen. 12-01-20~
#> 5 Zware ochtendspits door ongelukken. 12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen. 12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen. 12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten. 12-01-20~
function 3 - uses purrr::possibly
with function 1 to handle errors
failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
possibly( # return a failsafe on error
scrape_page,
otherwise = failsafe_tbl
)
)
#> # A tibble: 8 x 2
#> headings dates
#> <chr> <chr>
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot. 10-01-20~
#> 3 <NA> <NA>
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen. 12-01-20~
#> 5 Zware ochtendspits door ongelukken. 12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen. 12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen. 12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten. 12-01-20~
Created on 2020-09-30 by the reprex package (v0.3.0)