cdcfluview
cdcfluview copied to clipboard
add ILInet revisions as well?
Hello @hrbrmstr, another great package!
I am intersted in the ILInet data but I think the current package only returns what http://currents.plos.org/outbreaks/index.html%3Fp=39911.html#ref7 calls the gold standard.
That is, the ILInet data that was deemed as final and not available in real time. Instead, the historical datasets seems to be available at the following address
https://www.cdc.gov/flu/weekly/weeklyarchives2013-2014/data/senAllregt08.htm
Where the 8 in senAllregt08 means this is the ILI data that was current during week 8 of the 2013-2014 episode. If you pull the data for the next week
https://www.cdc.gov/flu/weekly/weeklyarchives2013-2014/data/senAllregt09.htm
you will notice that PAST ILI values may have been revised/modified.
It is very important to keep track of revisions over time because one really want to use the information that was really available at the time of the forecast. In other words, at during week T you only know the data shown in senAllregtT.htm
Is this something you can incorporate in the package? Perhaps we should store these as a list column. Let me know what you think
Thanks!
Neat! Moar data!
I found https://www.cdc.gov/flu/weekly/pastreports.htm which looks like a decent index (perhaps not thorough but it's workable).
Sadly
library(rvest)
library(furrr)
library(stringi)
library(tidyverse)
plan(multiprocess)
pg <- read_html("https://www.cdc.gov/flu/weekly/pastreports.htm")
html_nodes(pg, xpath=".//option[contains(., 'Week')]") %>%
html_attr("value") %>%
ifelse(grepl("^/", .), sprintf("http://www.cdc.gov%s", .), .) -> ili_urls
HEED <- possibly(httr::HEAD, list())
GEET <- possibly(httr::GET, list())
iu_df <- tibble(ili_url = ili_urls)
iu_df <- mutate(iu_df, pg = future_map(ili_url, GEET, httr::timeout(5)))
saveRDS(iu_df, "~/Data/cdc-weekly-crawl.rds")
REED <- possibly(xml2::read_html, list())
filter(iu_df, lengths(pg) > 0) %>%
mutate(ctype = map_chr(pg, ~.x$headers[["content-type"]])) %>%
filter(grepl("html", ctype)) %>%
mutate(content = map_chr(pg, ~httr::content(.x, as = "text", encoding = "UTF-8"))) %>%
mutate(doc = map(content, REED)) %>%
filter(lengths(doc) > 0) %>%
mutate(links = map(
doc, ~.x %>%
html_nodes(xpath = ".//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'view chart data')]") %>%
html_attr("href")
)) -> iu_df
iu_df$links %>%
unlist() %>%
stri_replace_all_regex("^.*archives|/.*$", "") %>%
unique()
didn't really give me back what I was hoping so it'll be a little bit before this is done but it's def a gd idea and def doable.
@hrbrmstr thanks! I actually think I have found another way. Let me run some checks and I will get back to you. Out of curiosity, to get the correct xpaths, did you use selectorgadget?
good news! there is actually an API just for that. https://github.com/cmu-delphi/delphi-epidata/issues/18
quick wrapper for that, er, "interesting" way to expose R functions (very javascript-esque, tho). https://github.com/hrbrmstr/delphiepidata
i'll add some "tidiers" once I grok the return values a bit more