covid19hub
covid19hub copied to clipboard
Functions to import reported daily sub-national new cases and deaths from LMIC
Description
A function for each country that accesses public data sources and makes the data accessible in R. Good examples in this package: https://github.com/epiforecasts/NCoVUtils for a number of countries.
Functions are existent for most European and Asian countries and the US. We are looking for data and functions for LMIC at the moment, especially African countries.
I suggest we use this issue to keep track of
- Needs for data and R functions for specific countries
- Available data sources for specific countries
- Functions that import the data sources identified in 2.
See below for google sheet tracking those.
Output
The output format of each function should be a long data frame containing the following columns:
- country
- admin_subdivision_level_1
- admin_subdivision_level_2 (if available)
- (more levels if available)
- date
- cases
- deaths
where cases and deaths stand for the newly reported cases and deaths on that day.
Functions can either be added to https://github.com/epiforecasts/NCoVUtils via pull request, or we can start our own package that wraps NCoVUtils and other solutions for the already implemented countries.
Countries already covered:
https://github.com/epiforecasts/NCoVUtils covers the following countries so far:
- Belgium
- Canada
- France
- Germany
- Italy
- Spain
- United Kingdom
- United States
- Japan
- Korea
- Afghanistan
Countries to be done
Spreadsheet to track requested countries, data sources and implementations: https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing
my own priority list:
- [ ] Burkina Faso
- [ ] Irak
- [ ] Democratic Republic of the Congo
- [ ] Syria
Links
A few places where data sources are indexed:
https://data.humdata.org/event/covid-19 https://coronavirustechhandbook.com/home https://www.europeandataportal.eu/data/datasets?locale=en&categories=heal&page=1&query=covid
@patrickbarks @paulc91 @ntncmch @scottyaz @sbfnk @epiforcasts @seabbs @hamishgibbs @jhellewell14
Would be good to start a google spreadsheet (if it doesn't exist) with sources for each country. For example http://covid19.health.gov.mw is a good source for Malawi.
We are keen to have contributions to our package but also happy for this to be a separate project if that makes sense. We've been talking about how/if we want to support it as a more widely known data resource and that is seeming to make more and more sense.
Google spreadsheet to track requests for countries, data sources and implementations: https://docs.google.com/spreadsheets/d/1uvg07BAmwKqLqhKvkejhkX7uvXiGCre4sz11Au3pz9Q/edit?usp=sharing
@seabbs, happy to contribute to NCoVUtils
Would like to work on Burkina Faso
Hey,
Can you pitch on the process of contributing?
Do you want us to PR {NCovUtils}
?
Also, there is: https://www.worldometers.info/coronavirus/
Here's a fun to get today and yesterday df:
get_worldmeter_df <- function(){
url <- xml2::read_html(
"https://www.worldometers.info/coronavirus/"
)
tbls <- rvest::html_table(url)
tbls[[1]] <- tbls[[1]][8:nrow(tbls[[1]]),]
tbls[[2]] <- tbls[[2]][8:nrow(tbls[[2]]),]
list(
today = tbls[[1]],
yesterday = tbls[[2]]
)
}
get_worldmeter_df()
Has Burkina Faso, Irak, Congo and Syria
get_worldmeter_df()$today[
tod$`Country,Other` %in% c("Burkina Faso", "Irak", "Congo", "Syria"),
]
Country,Other TotalCases NewCases TotalDeaths NewDeaths
99 Burkina Faso 484 27
147 Congo 60 5
166 Syria 25 2
TotalRecovered ActiveCases Serious,Critical
99 155 302
147 5 50
166 5 18
Tot Cases/1M pop Deaths/1M pop TotalTests Tests/1M pop
99 23 1
147 11 0.9
166 1 0.1
Continent
99 Africa
147 Africa
166 Asia
Hi @ColinFay,
yes, the best is to PR NCovUtils
.
I haven't seen any sub-national data (by region, province or similar) on wordlometers, am I missing something?
Think it would still be a good additional resource to the already existing functions to get national data from ECDC, WHO, JHU or similar, especially since there seems to be data on testing.
@ffinger not that I know of
Possible other source for Burkina Faso : https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19
Need to scrape the pdf(s)
Thanks for this! Anyone having time to implement scraping?
There is a figure here too, sources are probably the previously linked sitreps: https://fr.wikipedia.org/wiki/Pand%C3%A9mie_de_Covid-19_au_Burkina_Faso
Probably possible to scrape since the data seems to be in the code of the figure (click on modify code
to see).
Here's the code to download all the pdfs:
dir.create("burkina_covid")
for (
i in c(
"https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19",
"https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=1",
"https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=2",
"https://www.humanitarianresponse.info/en/op%C3%A9rations/burkina-faso/documents/table/themes/covid-19?page=3"
)
){
url <- xml2::read_html(
i
)
but <- rvest::html_nodes(url, ".dropdown-menu a")
lapply(
rvest::html_attr(but, "href"),
function(x){
download.file(
x,
file.path(
"burkina_covid",
basename(x)
)
)
}
)
}
> fs::dir_tree("burkina_covid/")
burkina_covid/
├── covidresponseplanremarks-french.docx
├── ghrp-covid19-en.pdf
├── ghrp-covid19-fr.pdf
├── integration_du_covid-19_dans_la_reponse_humanitaire.pdf
├── plan_de_riposte_covid19-revise_def.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_centre_nord_fevrier_2020-1_1.pdf
├── reach_bfa_suivi_situation_humanitaire_resultats_pertinents_covid19_region_sahel_fevrier_2020.pdf
├── sitrep_n27_du_24_03_20.pdf
├── sitrep_n_29_0.pdf
├── sitrep_n_32_covid-19_du_29_mars_2020_0.pdf
├── sitrep_n_33.pdf
├── sitrep_ndeg17_du_14_03_20.pdf
├── sitrep_ndeg21.pdf
├── sitrep_ndeg24_du_21_03_20.pdf
├── sitrep_ndeg25.pdf
├── sitrep_ndeg28.pdf
├── sitrep_ndeg35_0.pdf
├── sitrep_ndeg_20_du_17_03_20.pdf
├── sitrep_ndeg_22_du_19_03_20.pdf
├── sitrep_ndeg_26_du_23_03_20.pdf
├── sitrep_ndeg_31.pdf
├── sitrep_ndeg_34.pdf
├── sitrep_ndeg_36.pdf
├── sitrep_ndeg_37.pdf
├── sitrep_ndeg_38_covid_bfa_au_04_04_2020.pdf
├── sitrep_ndeg_39_1.pdf
├── sitrep_ndeg_40_0.pdf
├── sitrep_ndeg_41_au_7_avril_2020_1.pdf
├── sitrep_ndeg_42_covid-19_burkina_faso.pdf
├── sitrep_ndeg_43.pdf
└── sitrep_ndeg_44.pdf
Here's a piece of code to extract data from the latest pdf:
library(tabulizer)
res <- tabulizer::extract_text("burkina_covid/sitrep_ndeg_44.pdf")
res <- strsplit(res, "\n")[[1]]
num_extr <- function(
res, txt
){
gsub(
"[^:]*: ([0-9]*).*",
"\\1",
grep(txt, res, value = TRUE)
)
}
cont <- c(
"Cumul personnes contacts listées",
"Contacts confirmés COVID-19 depuis le début",
"Nbre de contacts sortis de suivi ce jours",
"Cumul de contacts sortis après 14 jours de suivis",
"Nombre de contacts à suivre",
"Nombre de contacts vus",
"Nombre de contacts non vus",
"Nombre de contacts devenus suspects",
"Nombre de nouveaux contacts"
)
x <- sapply(
cont, function(x){
num_extr(res, x)
}
)
tibble::rownames_to_column(
as.data.frame(x),
"type"
)
type x
1 Cumul personnes contacts listées 2409
2 Contacts confirmés COVID-19 depuis le début 272
3 Nbre de contacts sortis de suivi ce jours 31
4 Cumul de contacts sortis après 14 jours de suivis 1076
5 Nombre de contacts à suivre 1061
6 Nombre de contacts vus 1
7 Nombre de contacts non vus 19
8 Nombre de contacts devenus suspects 10
9 Nombre de nouveaux contacts 99
I'm french so these seems to be the interesting part, but as I'm no expert in the field that would be nifty to have s.o with domain knowledge pointing me to the interesting part of the pdf.
nouveaux cas confirmés et décès par district seraient super. but it doesn't there is any pattern in the way this information is given in the pdf (unlike the suivi des contacts section above), so I'm guessing it would be difficult to scrape consistently.
here's a attempt at a package to download and scrape data: https://github.com/ColinFay/covidbf
Let me know if you want me to work more on this.
@ColinFay thanks a lot. As mentioned by @PaulC91 the information you are scraping is the reports about contact tracing. The new cases per region or per district are hidden in the text and not consistently reported it seems. There is also the map at the beginning that gives new cases by district, but very hard to scrape I believe...
Just to check, have you tried contacting the people listed at the bottom of the pdf? They might be willing to share the data
Oh and, what's LMIC?
Low to middle income country
On Sun, Apr 12, 2020 at 9:12 PM Colin Fay [email protected] wrote:
External Email
Oh and, what's LMIC?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/reconhub/covid19hub/issues/5#issuecomment-612662606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKFPHCKERQFDUDSC2ZPCTQLRMIHDNANCNFSM4MEQPAQA .
@ColinFay yes, we are in contact with authorities.
I added some new countries and potential data sources to the spreadsheet.
See here for details: https://github.com/epiforecasts/NCoVUtils/issues/72#issuecomment-620715775