covid19germany icon indicating copy to clipboard operation
covid19germany copied to clipboard

Why is the "Change from previous day" number reported by the RKI not equal to the last entry in "NumberNewTestedIll"?

Open slawomirmatuszak opened this issue 3 years ago • 6 comments

Yesterday I noticed, that after summarising your dataset, it turned out 1300 new cases only, while on RKI dashboard it was over 2000. Today is the same - 961 new cases (RKI shows 2279), no new deaths (RKI - 2). However, cumulative numbers apear to be correct (same as on dashboard). What's interesting, when I filtered data to previous day it looks as if it had been updated. Instead 1300 cases yesterday shows 2353. Any reason why is like that?

slawomirmatuszak avatar Oct 04 '20 10:10 slawomirmatuszak

Oh - maybe the data structure provided by the RKI changed yet again? Please show me your code. Maybe I can figure out where the problem might be coming from

nevrome avatar Oct 04 '20 10:10 nevrome

Code:

library(tidyverse)
library(covid19germany)

df <- get_RKI_timeseries()
max.date<- group_RKI_timeseries(df)%>%
  tail()
max.date

As you can see, the number of new cases from 3rd October is only 961. It is not a problem with your function. When I do summary usign dplyr I get same results. Yesterday it was similiar - there was 1300 new cases on 2nd October. But today, it appears that this figure has been updated ( 2353). Cumulative figures appear to be correct.

slawomirmatuszak avatar Oct 04 '20 11:10 slawomirmatuszak

There are multiple ways to calculate the number of new cases and the RKI updates their dataset for past days as well. It's a pretty confusing dataset, honestly.

If you go here and click on More you get a description of the raw dataset and its columns (in german). get_RKI_timeseries() yields a slightly simplified version of this. I'm not sure though, why "our" numbers for the current day lag behind the RKI data. Probably it has something to do with the reported dates, where the dataset distinguishes between "Meldedatum", "Referenzdatum" and "Erkrankungsdatum".

You can download the raw version of the dataset with get_RKI_timeseries(raw_only = T). Maybe you can figure out what causes this difference. I will take a look as well.

nevrome avatar Oct 04 '20 12:10 nevrome

Ha - I think understand it now. This sentence from the german version of the daily RKI report is crucial:

Die Differenz zum Vortag bezieht sich auf Fälle, die dem RKI täglich übermittelt werden. Dies beinhaltet Fälle, die am gleichen Tag oder bereits an früheren Tagen an das Gesundheitsamt gemeldet worden sind.

That means the RKI reports what it learns from the local health authorities. This data might be from previous days, so it is counted towards these previous days in the "Meldedatum" column. To calculate the "Change from previous day" number the RKI shows in the dashboard and the reports we would need an additional column "Date when reported from the local health authorities to the RKI".

nevrome avatar Oct 04 '20 13:10 nevrome

I think that most likely your explanation is correct. I’ve tried to compare your dataset, raw data form your package and data from arcgis. It was aggregated by date.

`library(tidyverse) library(lubridate) library(covid19germany)

data from your package

df <- covid19germany::get_RKI_timeseries() grouped <- group_RKI_timeseries(df)%>% arrange(desc(Date))%>% head()

data form arcgis

df2 <- read_csv("https://opendata.arcgis.com/datasets/dd4580c810204019a7b8eb3e0b329dd6_0.csv") arcgis <- df2 %>% mutate(Meldedatum = ymd_hms(Meldedatum))%>% group_by(Meldedatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Meldedatum))%>% head()

raw data from your package

raw.data <- get_RKI_timeseries(raw_only = T)

grouped by Refdatum

Refdatum <- raw.data %>% group_by(Refdatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Refdatum))%>% head()

grouped by Meldedatum

Meldedatum <- raw.data %>% group_by(Meldedatum)%>% summarise(AnzahlFall=sum(AnzahlFall))%>% mutate(AnzahlFall.cum = cumsum(AnzahlFall))%>% arrange(desc(Meldedatum))%>% head()`

New daily cases from the latest day are always wrong. What’s interesting , cumulative number of cases from raw data and arcgis data is incorrect, while figure from your dataset is the same as on RKI dashboard.

slawomirmatuszak avatar Oct 05 '20 13:10 slawomirmatuszak

Alright - good that you did this test. For the cumulative number you have to consider the encoding in the NeuerFall column. Maybe this causes the difference between your code and what the package and the dashboard report.

nevrome avatar Oct 05 '20 16:10 nevrome