covid_age
covid_age copied to clipboard
US New Jersey Data
In Output_10.csv
at the state of New Jersey i can see several weeks in a row with the same records, day after day. Does this indicate zero new cases in the period/region or the lack of new data?
I did this:
> NJ <- read.csv(
+ file = "Output_10_20200901.csv",
+ header = TRUE,
+ skip = 3
+ ) %>%
+ filter(Region == "New Jersey",
+ Sex == "b") %>%
+ mutate(
+ Date = as.Date(Date, format = "%d.%m.%Y"))
>
>
> NJ %>% select(Region, Date, Age, Cases) %>%
+ filter(Date > "2020-08-01") %>%
+ pivot_wider(names_from = Date, values_from = Cases)
# A tibble: 11 x 24
Region Age `2020-08-02` `2020-08-03` `2020-08-04` `2020-08-05` `2020-08-06` `2020-08-07` `2020-08-08`
<fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 New J~ 0 1967. 1967. 1967. 1967. 1967. 1967. 1967.
2 New J~ 10 4962. 4962. 4962. 4962. 4962. 4962. 4962.
3 New J~ 20 23854. 23854. 23854. 23854. 23854. 23854. 23854.
4 New J~ 30 31052. 31052. 31052. 31052. 31052. 31052. 31052.
5 New J~ 40 26123 26123 26123 26123 26123 26123 26123
6 New J~ 50 34429. 34429. 34429. 34429. 34429. 34429. 34429.
7 New J~ 60 26415. 26415. 26415. 26415. 26415. 26415. 26415.
8 New J~ 70 14328. 14328. 14328. 14328. 14328. 14328. 14328.
9 New J~ 80 9587. 9587. 9587. 9587. 9587. 9587. 9587.
10 New J~ 90 8184. 8184. 8184. 8184. 8184. 8184. 8184.
11 New J~ 100 70.2 70.2 70.2 70.2 70.2 70.2 70.2
# ... with 15 more variables: `2020-08-09` <dbl>, `2020-08-10` <dbl>, `2020-08-11` <dbl>,
# `2020-08-12` <dbl>, `2020-08-13` <dbl>, `2020-08-14` <dbl>, `2020-08-19` <dbl>, `2020-08-20` <dbl>,
# `2020-08-21` <dbl>, `2020-08-22` <dbl>, `2020-08-23` <dbl>, `2020-08-24` <dbl>, `2020-08-25` <dbl>,
# `2020-08-26` <dbl>, `2020-08-27` <dbl>
I spot checked the screenshots of the dashboard where the data were captured. Indeed, in early August for a series of days (range TBD) the "last update" note was stuck at July 30. We will amend this in the input data. In effect, it will turn out to be a calendar gap in the data. This is among the dashboards that are manually captured, easily understandable human error.
Although one could interpolate fractions between the left and right, and rescale to the crude totals, which are more likely available. So far we're not doing that sort of thing in the database, but it is among the options that may be forthcoming. Your opinion in the matter would be most welcome.
In my analyses I am also applying linear interpolation if the gap of missing data is small (1 or 2 time intervals). Since I am mostly working with weekly data now that means 1 or 2 weeks. However if the gap is larger different kind or interpolations might be more appropriate provided that there is sufficient data before AND after the gap. As a rule we could define
Otherwise, I consider that is is more indicated to just acknowledge the missing information and leave it up to the user to model it.