cdcfluview icon indicating copy to clipboard operation
cdcfluview copied to clipboard

add ILInet revisions as well?

Open randomgambit opened this issue 6 years ago • 4 comments
trafficstars

Hello @hrbrmstr, another great package!

I am intersted in the ILInet data but I think the current package only returns what http://currents.plos.org/outbreaks/index.html%3Fp=39911.html#ref7 calls the gold standard.

That is, the ILInet data that was deemed as final and not available in real time. Instead, the historical datasets seems to be available at the following address

https://www.cdc.gov/flu/weekly/weeklyarchives2013-2014/data/senAllregt08.htm

Where the 8 in senAllregt08 means this is the ILI data that was current during week 8 of the 2013-2014 episode. If you pull the data for the next week

https://www.cdc.gov/flu/weekly/weeklyarchives2013-2014/data/senAllregt09.htm

you will notice that PAST ILI values may have been revised/modified.

It is very important to keep track of revisions over time because one really want to use the information that was really available at the time of the forecast. In other words, at during week T you only know the data shown in senAllregtT.htm

Is this something you can incorporate in the package? Perhaps we should store these as a list column. Let me know what you think

Thanks!

randomgambit avatar May 28 '19 14:05 randomgambit

Neat! Moar data!

I found https://www.cdc.gov/flu/weekly/pastreports.htm which looks like a decent index (perhaps not thorough but it's workable).

Sadly

library(rvest)
library(furrr)
library(stringi)
library(tidyverse)

plan(multiprocess)

pg <- read_html("https://www.cdc.gov/flu/weekly/pastreports.htm")

html_nodes(pg, xpath=".//option[contains(., 'Week')]") %>% 
  html_attr("value") %>% 
  ifelse(grepl("^/", .), sprintf("http://www.cdc.gov%s", .), .) -> ili_urls

HEED <- possibly(httr::HEAD, list())
GEET <- possibly(httr::GET, list())

iu_df <- tibble(ili_url = ili_urls)
iu_df <- mutate(iu_df, pg = future_map(ili_url, GEET, httr::timeout(5)))

saveRDS(iu_df, "~/Data/cdc-weekly-crawl.rds")

REED <- possibly(xml2::read_html, list())

filter(iu_df, lengths(pg) > 0) %>% 
  mutate(ctype = map_chr(pg, ~.x$headers[["content-type"]])) %>% 
  filter(grepl("html", ctype)) %>% 
  mutate(content = map_chr(pg, ~httr::content(.x, as = "text", encoding = "UTF-8"))) %>%
  mutate(doc = map(content, REED)) %>%
  filter(lengths(doc) > 0) %>%
  mutate(links = map(
    doc, ~.x %>% 
      html_nodes(xpath = ".//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'view chart data')]") %>% 
      html_attr("href")
  )) -> iu_df

iu_df$links %>% 
  unlist() %>% 
  stri_replace_all_regex("^.*archives|/.*$", "") %>% 
  unique()

didn't really give me back what I was hoping so it'll be a little bit before this is done but it's def a gd idea and def doable.

hrbrmstr avatar May 28 '19 19:05 hrbrmstr

@hrbrmstr thanks! I actually think I have found another way. Let me run some checks and I will get back to you. Out of curiosity, to get the correct xpaths, did you use selectorgadget?

randomgambit avatar May 29 '19 00:05 randomgambit

good news! there is actually an API just for that. https://github.com/cmu-delphi/delphi-epidata/issues/18

randomgambit avatar May 29 '19 00:05 randomgambit

quick wrapper for that, er, "interesting" way to expose R functions (very javascript-esque, tho). https://github.com/hrbrmstr/delphiepidata

i'll add some "tidiers" once I grok the return values a bit more

hrbrmstr avatar May 29 '19 12:05 hrbrmstr