incidence2
incidence2 copied to clipboard
Document how to handle grouped line list columns
It is unclear from the current {incidence2} documentation whether the package, specifically incidence2::incidence(), can handle grouped columns.
An example of such a line list with a grouped column is the Ebola simulated line list in {outbreaks}.
head(outbreaks::ebola_sim_clean$linelist)
#> case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> 1 d1fafd 0 <NA> 2014-04-07 2014-04-17
#> 2 53371b 1 2014-04-09 2014-04-15 2014-04-20
#> 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25
#> 4 6c286a 2 <NA> 2014-04-27 2014-04-27
#> 5 0f58c4 2 2014-04-22 2014-04-26 2014-04-29
#> 6 49731d 0 2014-03-19 2014-04-25 2014-05-02
#> date_of_outcome outcome gender hospital lon lat
#> 1 2014-04-19 <NA> f Military Hospital -13.21799 8.473514
#> 2 <NA> <NA> m Connaught Hospital -13.21491 8.464927
#> 3 2014-04-30 Recover f other -13.22804 8.483356
#> 4 2014-05-07 Death f <NA> -13.23112 8.464776
#> 5 2014-05-17 Recover f other -13.21016 8.452143
#> 6 2014-05-07 <NA> f <NA> -13.23443 8.468572
Created on 2024-04-17 with reprex v2.1.0
The way I've been passing line list data like this to incidence() is using tidyr::pivot_wider() beforehand.
library(magrittr)
outbreaks::ebola_sim_clean$linelist %>%
tidyr::pivot_wider(
names_from = outcome,
values_from = date_of_outcome
)
#> # A tibble: 5,829 × 12
#> case_id generation date_of_infection date_of_onset date_of_hospitalisation
#> <chr> <int> <date> <date> <date>
#> 1 d1fafd 0 NA 2014-04-07 2014-04-17
#> 2 53371b 1 2014-04-09 2014-04-15 2014-04-20
#> 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25
#> 4 6c286a 2 NA 2014-04-27 2014-04-27
#> 5 0f58c4 2 2014-04-22 2014-04-26 2014-04-29
#> 6 49731d 0 2014-03-19 2014-04-25 2014-05-02
#> 7 f9149b 3 NA 2014-05-03 2014-05-04
#> 8 881bd4 3 2014-04-26 2014-05-01 2014-05-05
#> 9 e66fa4 2 NA 2014-04-21 2014-05-06
#> 10 20b688 3 NA 2014-05-05 2014-05-06
#> # ℹ 5,819 more rows
#> # ℹ 7 more variables: gender <fct>, hospital <fct>, lon <dbl>, lat <dbl>,
#> # `NA` <date>, Recover <date>, Death <date>
Created on 2024-04-17 with reprex v2.1.0
These columns can then be selected using the date_index argument in incidence().
daily <- incidence(
linelist,
date_index = c(
onset = "date_of_onset",
death = "Death"
),
interval = "daily"
)
Having the best way to work with this data documented somewhere in the {incidence2} package or add functionality to handle it would be great.
This issue might also link with the {linelist} package and whether there is a line list standard from that package and whether any of the columns are grouped. If so an as_incidence.linelist() S3 method might be beneficial. @Bisaloo is this the case or are all {linelist} tags for ungrouped columns?
Cheers @joshwlambert. Yes this is exactly how I'd handle it (outside of incidence). Will add an example along the lines of
outbreaks::ebola_sim_clean$linelist |>
pivot_wider(names_from = outcome, values_from = date_of_outcome) |>
incidence(
date_index = c(
onset = "date_of_onset",
hospitalisation = "date_of_hospitalisation",
death = "Death"
),
interval = "daily"
)
The issue we have is that incidence2 is expecting wide data (albeit potentially aggregated) whereas this is a mixture of wide and long. Whilst it may be possible to adapt allow for long-style "outcome" (and asociated date) columns I think the tidyr approach is so elegant my preference is just to ensure that is documented.
As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?
As an aside I may also had a grates version to the examples to illustrate the different approaches.
As you alude to we could do a lot with an as_incidence.linelist() assuming that handles this wide/long mixture in it's specification - @Bisaloo?
Couple of thoughts on this:
- I've been considering it on multiple occasions but I'm still not convinced this should be a
as_incidence()method. My view ofas_()methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue. - Would we actually be able to do much more than with the default
incidence()function? E.g., in the case of a column tagged withdate_outcome, are we sure that users will always prefer to pivot and convert it todate_death+date_recovery? Or can we imagine that they would be happy to passdate_outcomedirectly toincidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?
- I've been considering it on multiple occasions but I'm still not convinced this should be a
as_incidence()method. My view ofas_()methods is that they convert objects between two different but equivalent formats. But line list data and aggregated count data are two very different object. A different way to say it is that in most cases, it should be possible to do the class round-trip in a (quasi-)transparent, which would not be the case here as there is loss of information when aggregating the data. But maybe this is just a terminology issue.
Interesting. I've always viewed as_ methods as (potentially lossy) casts (e.g. as.integer(1.5)). Alternatively you could make the incidence() funciton generic but that feels less satisfying (although cannot put my finger on way).
- Would we actually be able to do much more than with the default
incidence()function? E.g., in the case of a column tagged withdate_outcome, are we sure that users will always prefer to pivot and convert it todate_death+date_recovery? Or can we imagine that they would be happy to passdate_outcomedirectly toincidence(). In others words, I see value for such a function only if we can provide more informative defaults than for standard data.frames. Is it really the case here?
These are the interesting questions. I think it very much depends on what a typical input linelist (in the non-package sense) looks like. incidence only really handles wide, potentially aggregated, data as this was, in essence, what I inherited spec-wise. If data generally has this form with additional "outcome" and "date of outcome" we could perhaps adapt but I'm loathe to do so with out more of a formal spec of a linelist.
An aside: My gut feeling is that incidence2 is currently a package without a reason. I'm ok with this but thinks it's important to be open here. I tend to push people towards dplyr/data.table in combination with grates as this is more aligned with how I approach things. It could be useful if there were a range of methods (e.g. models for trend fitting) that were incorporated in incidence2 (calling functions from suggested packages) so people could easily go:
data -> incidence2 -> model fits by groups/counts.
but after 2 to 3 years I don't think there is a desire for this and unless a specific need comes up in ${DAYJOB} it's not something I'll push for.