covid19germany icon indicating copy to clipboard operation
covid19germany copied to clipboard

Data download is unreliable and sometimes (!) yields incomplete data

Open arne1921KF opened this issue 3 years ago • 6 comments

Today (2020-01-11), timeseries data downloaded via usual get_RKI_timeseries() with standard parameter url = https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv" delivers only some data from Hamburg, Schleswig-Holstein and Niedersachsen.

The page https://hub.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0 informs they are currently changing the DL options, and https://www.arcgis.com/home/item.html?id=f10774f1c63e40168479a1feb6c7ca74 should currently be used.

The DL link there is currently hidden on the page behind the links/buttons.

arne1921KF avatar Nov 01 '20 15:11 arne1921KF

@stschiff already observed a similar issue last week. Has solved itself overnight. Maybe we have to switch to the alternative download option eventually, but for now I suggest to wait once more.

nevrome avatar Nov 01 '20 15:11 nevrome

So right now it seems to work again:

> rki_timeseries <- get_RKI_timeseries()
> unique(rki_timeseries$Bundesland)
 [1] "Brandenburg"            "Bayern"                
 [3] "Niedersachsen"          "Nordrhein-Westfalen"   
 [5] "Baden-Württemberg"      "Saarland"              
 [7] "Rheinland-Pfalz"        "Schleswig-Holstein"    
 [9] "Hessen"                 "Hamburg"               
[11] "Bremen"                 "Sachsen"               
[13] "Thüringen"              "Berlin"                
[15] "Mecklenburg-Vorpommern" "Sachsen-Anhalt" 

nevrome avatar Nov 02 '20 18:11 nevrome

....and gone again. Now they changed something in the data itself, it seems. I get parsing failures. Looks like the date columns changed. That breaks your code.

I hate it when data providers do this.

arne1921KF avatar Nov 04 '20 08:11 arne1921KF

Hm - can't confirm right now. Seems to work again.

But I get the feeling this download feature breaks multiple times a day. Maybe it's because the file grew to >55mb and the way we download it is just not suitable any more.

Maybe we should copy it automatically to an extra branch here on github once a day and point the default path of get_RKI_timeseries to our mirror.

nevrome avatar Nov 04 '20 20:11 nevrome

Aaaaand dead again. Only Schleswig-Holstein present in the timeseries. Has been like this at 5 am, when my bot tried to pull the current data. Is still the case at 9 am.

A git of the data would be rad. I seriously would like to know why the RKI isn't doing this themselves: just pushing the data to github, as soon as it is in. Like that, the dataset would even be transparent for monitoring changes directly using versioning.

arne1921KF avatar Nov 10 '20 08:11 arne1921KF

I merged #34 now to permanently enable the download from the alternative source. This seems to be more reliable.

nevrome avatar Nov 16 '20 09:11 nevrome