polite More control over handling non-200 responses when scraping

More control over handling non-200 responses when scraping

Open francisbarton opened this issue 3 years ago • 0 comments

I feel like this is a rather vague feature request, but hopefully the example below will help to illustrate my point. I think polite is great project and I'd like to see it used more widely.

With httr you can ask for the response code from a GET request to a URL, and then choose what action to take if, for example, the code is ! == 200. polite::scrape uses httr I believe, but handles the response internally, choosing to return NULL from a 404 for example. I'm wondering if it could be made less opinionated.

Here's a scraping script I wrote the other day, using purrr::map_dfr to combine responses into a single tibble. But if one of a list of URLs returns a 404 then the NULL value breaks the whole thing. I can get round this by rewriting the script (ex 2 below), or by using purrr::possibly (ex 3 below) or maybe by just using map with a reduce(bind_rows) ... but it might be good if polite gave the user more freedom internally as to how it should handle missing or invalid URLs rather than necessarily returning NULL.

I hope that makes sense. Here's my examples:

library(dplyr)
library(polite)
library(purrr)
library(rvest)
library(stringr)

url_root <- "https://www.ongelukvandaag.nl/archief/"

# create three URLs to test
urls <- paste0(url_root, 10:12, "-01-2015") # second URL returns 404

session <- polite::bow(
  url = url_root,
  user_agent = "Francis Barton [email protected]",
  delay = 3
)

function 1

scrape_page <- function(url) {
  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html", verbose = TRUE)

  headings <- page_text %>%
    rvest::html_nodes("h2") %>%
    rvest::html_text()

  dates <- page_text %>%
    rvest::html_nodes(".text-muted") %>%
    rvest::html_text() %>%
    stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

  dplyr::tibble(headings = headings, dates = dates)
}

# run function 1: breaks due to NULL return
purrr::map_dfr(urls, scrape_page)
#> Attempt number 2.
#> Attempt number 3.This is the last attempt, if it fails will return NULL
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> Error in UseMethod("xml_find_all"): no applicable method for 'xml_find_all' applied to an object of class "NULL"

function 2 - includes failsafe for 404s/NULL returns

scrape_page_safe <- function(url) {
  failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)

  page_text <- polite::nod(session, url) %>%
    polite::scrape(accept = "html")

  if (is.null(page_text)) {
    failsafe_tbl
  } else {
    headings <- page_text %>%
      rvest::html_nodes("h2") %>%
      rvest::html_text()

    dates <- page_text %>%
      rvest::html_nodes(".text-muted") %>%
      rvest::html_text() %>%
      stringr::str_extract("[0-9]{2}-[0-9]{2}-[0-9]{4}")

    dplyr::tibble(headings = headings, dates = dates)
  }
}

# run function 2: succeeds
purrr::map_dfr(urls, scrape_page_safe)
#> Warning: Client error: (404) Not Found https://www.ongelukvandaag.nl/archief/
#> 11-01-2015
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

function 3 - uses purrr::possibly with function 1 to handle errors

failsafe_tbl <- dplyr::tibble(headings = NA_character_, dates = NA_character_)
purrr::map_dfr(urls,
  possibly(          # return a failsafe on error
    scrape_page,
    otherwise = failsafe_tbl
  )
)
#> # A tibble: 8 x 2
#>   headings                                                             dates    
#>   <chr>                                                                <chr>    
#> 1 Inbreker Aldi Hilvarenbeek na botsing met boom aangehouden in gesto~ 10-01-20~
#> 2 Kettingbotsing met twaalf voertuigen op A58 bij Oirschot.            10-01-20~
#> 3 <NA>                                                                 <NA>     
#> 4 losgebroken paard doodgereden na aanrijdingen Amstelveen.            12-01-20~
#> 5 Zware ochtendspits door ongelukken.                                  12-01-20~
#> 6 Zwaargewonde bij aanrijding in Huissen.                              12-01-20~
#> 7 Zwaargewonde bij botsing op Broekdijk in Nuenen.                     12-01-20~
#> 8 Twee gewonden bij ongeluk Ochten.                                    12-01-20~

^{Created on 2020-09-30 by the reprex package (v0.3.0)}

Oct 01 '20 08:10 francisbarton

polite polite copied to clipboard

More control over handling non-200 responses when scraping

polite
polite copied to clipboard