incidence
incidence copied to clipboard
Restructure incidence object as data frame to allow for facetting
This is a tricky one, since we aggregate data into the $counts
matrix, but it would be nice to be able to combine plotting incidence
objects with a group
for filling and facet_grid()
using another criteria. it would get something similar to:
ggplot(x) + geom_histogram(aes(x = dates, fill = aaa) + facet_grid(bbb ~ ccc)
where aaa
bbb
and ccc
are factors.
This may need some rethinking of our internal data represention, so I appreciate it may be mid-to-long term changes.
This is related to #76
as I mentioned in #76, $counts
could be an array, but yes, that would take a LOT of refactoring.
In the meantime, I could write a vignette on how to work through this with the tidyverse to address issues like #86
This would be very useful. Wanted to make some epicurves yesterday and used ggplot as couldn't facet with incidence.
Note that this was brought up in Ben Bolker's review of the incidence paper:
The main use of the package is for converting from line lists to aggregated incidence data. It would be useful (I can't tell if it's possible) to easily be able to aggregate data that are already in date/incidence form to coarser scales, or to approximately disaggregate incidence data.
My initial thoughts on this:
[this] would require re-structuring the incidence class OR creating a new class (flexincidence?) that could simply just store the dates, intervals, and grouping info and generate incidence objects on the fly (I think Jun Cai suggested something like this sometime in the past).
Thinking about it, this new object can simply inherit from a data frame so that it can easily be plugged into existing data manipulation architecture. We would need to store the aggregation information as attributes (date column, interval, date range, and grouping), but that' shouldn't be too difficult.
The user should be able to use the accessors to get the dates, range, counts, etc. One of the challenges will be how to represent faceted counts when someone wants to use get_counts()
. We could return an array (as suggested above), but that may be overkill when all the user would really need is an aggregated long data frame.
I just came up with an idea that we may implement the facetting functionality for incidence plots in an easy, efficient and elegant way. Why not make the best use of the existing facetting functionality provided by ggplot2 package? If incidence plots can be regarded as an extension of ggplot2 Stat
or Geom
(e.g., geom_line
or geom_bar
), then the facetting functionality is inherently implemented and already there. Fortunately, since version 2.0.0, ggplot2 has provided the the ggproto system to extend ggplot2 by creating a new stat, geom, or theme.
Looking at the very first example from the Extending ggplot2
vignette, it draws the convex hull of a set of points by creating a new stat.
StatChull <- ggproto("StatChull", Stat,
compute_group = function(data, scales) {
data[chull(data$x, data$y), , drop = FALSE]
},
required_aes = c("x", "y")
)
stat_chull <- function(mapping = NULL, data = NULL, geom = "polygon",
position = "identity", na.rm = FALSE, show.legend = NA,
inherit.aes = TRUE, ...) {
layer(
stat = StatChull, data = data, mapping = mapping, geom = geom,
position = position, show.legend = show.legend, inherit.aes = inherit.aes,
params = list(na.rm = na.rm, ...)
)
}
Test the new stat_chull
stat.
ggplot(mpg, aes(displ, hwy, colour = drv)) +
geom_point() +
stat_chull(fill = NA)
Once the stat_chull
stat is created, ggplot2 gives a lot for free, including the facetting functionality. For example,
ggplot(mpg, aes(displ, hwy, colour = drv)) +
geom_point() +
stat_chull(fill = NA) +
facet_grid(. ~ drv)
Therefore, instead of adjusting the storing structure, we can re-implement the incidence plots in the way of extending ggplot2, for instance, a bar geom called geom_incidence()
. The powerful functionalities of graphics and customizations provided by ggplot2, including facetting, are thus inborn. I believe the amount of code to be rewritten would be much less.
I like the idea, but it looks like this would not be using the incidence
objects any more, but the raw linelist instead, right? Looks like a slightly different problem from the idea of having an object x
which can be used to derive tables of case counts, with a plot(x)
that can be facetted.
I think the incidence
objects can be kept. We just implement the idea within the existing incidence.plot()
and return a ggplot2 object, which also supports chaining pipes. For instance,
myplot <- function(my.obj) {
# some conversions can be done before plot
p <- ggplot(my.obj, aes(displ, hwy, colour = drv)) +
geom_point() +
stat_chull(fill = NA)
return(p)
}
myplot(mpg) +
facet_grid(. ~ drv)
I think @caijun has the right idea here. If we are going to move to a framework where incidence objects are created on the fly, there's no reason why we can't also create ggplot2 geom functions since these would inherently rely on the same underlying architecture to create the incidence object in the first place.
This way people can do something like:
indicence(x, dates = date_of_onset, interval = "1 ISO week", group = gender) %>%
ggplot() +
geom_incidence(show_cases = TRUE) + # no extra arguments since the internal data is already an incidence object
scale_fill_incidence(pal = 1) +
scale_x_incidence() +
facet_grid(aaa ~ bbb+ccc)
or
ggplot(x, aes(date_of_onset, fill = gender)) +
geom_incidence(interval = "1 ISO week", show_cases = TRUE) +
scale_fill_incidence(pal = 1) +
scale_x_incidence(interval = "1 ISO week") +
facet_grid(aaa ~ bbb+ccc)
Theoretically, even adding the fit objects should work (though it will be wonky since the users would have to use the %>%
syntax instead of +
)
Of course, plot.incidence()
should still produce the same plots as it always did with the difference being that the internals will use the above construct:
incidence(x, dates = date_of_onset, interval = "1 ISO week", group = gender) %>%
plot(show_cases = TRUE) +
facet_grid(aaa ~ bbb + ccc)
We've already seen that several users want to do things with the epicurve that can't really be done with the current framework due to limitations on the data structure itself because it represents an immutable summary, so it make sense to show people that they can use the plot.incidence()
method for a quick visualisation and the ggplot geoms for a more customized interface.
Theoretically, even adding the fit objects should work (though it will be wonky since the users would have to use the
%>%
syntax instead of+
)
%>%
and +
can be re-defined according to our needs in incidence package as they are also objects. Moreover, I like those functions geom_incidence()
, scale_fill_incidence()
and scale_x_incidence()
(used for customizing the x-axis time label).
The more I think about it, the more I'm thinking that we should port the internal functionality to the {tsibble} package. It has everything that we have except for the plotting. All of the steps below can be abstracted away for our users and they can return an incidence object if they want or they can return a tsibble. Our plotting can take care of tsibble objects as well. The only problem is that it will make the incidence package heavier.
library(tsibble)
library(dplyr)
library(aweek)
set_week_start("Saturday")
ll <- outbreaks::ebola_sim_clean$linelist
ll %>%
as_tsibble(key = case_id, index = date_of_onset) %>%
index_by(week = ~as.Date(aweek::as.aweek(.))) %>%
group_by(gender) %>%
summarize(n = n())
#> # A tsibble: 698 x 3 [1D]
#> # Key: gender [2]
#> gender week n
#> <fct> <date> <int>
#> 1 f 2014-04-07 1
#> 2 f 2014-04-21 1
#> 3 f 2014-04-25 1
#> 4 f 2014-04-26 1
#> 5 f 2014-04-27 1
#> 6 f 2014-05-01 2
#> 7 f 2014-05-03 1
#> 8 f 2014-05-04 1
#> 9 f 2014-05-06 2
#> 10 f 2014-05-07 2
#> # … with 688 more rows
Created on 2019-12-16 by the reprex package (v0.3.0)
Sound great! I really like the idea of relying on tsibble
for the heavy lifting. So in short, and incidence
object would essentially be a tsibble
with a standardised column for dates
and some validated info on interval
(possibly stored as attribute
)?
As far as I can tell, re-implementing old features should be easy:
-
get_counts()
: merelycount(dates, a, b)
-
get_n()
: could benrow()
, or maybe even accept optional groupingcount(a, b)
-
get_dates()
: would becomeselect(dates)
-
group_names()
:setdiff(names(x), "dates")
-
get_interval()
: should be easy, only depends on what we do with this (stored asattribute
? calculated on the fly?)
Importantly, that also means people will be able to use standard dplyr
commands if they are more familiar with them (e.g. count
rather than get_counts
).