EpiModel
EpiModel copied to clipboard
Develop consistent approach to Sim SD in print.netdx with 1 sim
The SD values in the print.netdx
tables originally calculated the SD in the formation
, duration
and dissolution
data collapsed across all simulations and timesteps. This really wasn't great, since for duration
the data across time steps aren't independent, and the intrerpretation is flawed. This issue is especially bad when the durations are imputed, since there is no variation within the imputation. So, when we added the imputation functionality we switched the SDs to only representing the SD across simulations (with data first averaged across time within simulation). Apparently though, we (meaning I) only did this for duration
and dissolution
, and not for formation
. I propose that we also do it for formation
. This will remove some useful information, but ensure consistency.
If we wish to keep the within-sim variation measure in there, then we need to take one of two approaches:
- Rename from Sim SD to SD; add back in the values for
dissolution
for one sim since these are interpretable; update docs to explain how the values are now calculated, and that they mean something different forduration
than the other two, and why. - Split out within-sim and across-sim measures in two columns. Always give
duration
an NA for within-sim. Make all withins-sims NA when nsteps=1 and all across-sim NA when nsims=1. Update docs.
The latter retains the most info, but is a lot of added work for what is really meant to be a quick-and-dirty glance.
I favor sticking with just cross-sim SD and removing formation
values when nsim=1.
@smjenness @AdrienLeGuillou @chad-klumb @martinamorris opinions?
This is a good question @sgoodreau. Everytime I've ventured into this from a research perspective, I've wanted to be able to decompose the overall variation into within and between simulations. I'm not 100% sure how this is done now in netdx
, so let me first see if I understand correctly. Then maybe I can make an informed suggestion.
Duration
In both cases, we're looking for a dx of how much the mean duration varies. I'm assuming that we're using mean age of active ties at each timestep (meanage(t)
) as our estimate of mean duration. So if we want to distinguish within vs between sim variation:
-
within-sim, there is a distribution of
meanage(t)
across t, and we calculateSD_sim(meanage(t))
. To get a single estimate of this within-sim SD across simulations, would you take themean(SD_sim)
? -
between-sim, we take the
mean_sim
=mean(meanage(t))
for each simulation, and calculateSD(mean_sim)
.
Is that right?
Formation and Dissolution
These both use the proportions as the basic measure: proportion of ties formed in empty dyads, and proportion of dyads that dissolve respectively
Then the analogous measures to duration can be constructed within and between sims.
Imputation
I understand how imputation is used in estimation to address missing data. I don't understand how imputation would be used in simulation, so I'm not sure why it is relevant here.
This is a good question @sgoodreau. Everytime I've ventured into this from a research perspective, I've wanted to be able to decompose the overall variation into within and between simulations. I'm not 100% sure how this is done now in
netdx
,
It isn't. Originally the two were lumped together into a single metric. Then .as per issue #274, I re-wrote this to ignore the within-sim variation and only calculate across-sim variation for duration and dissolution metrics. It appears that we continue to combine the two into one metric for formation. When I filed this issue here I was thinking that that latter issue was an error and I missed chagning it. But now that I've just gone back and re-read #274, I realize that we hadn't intended to change that one. But maybe we should for consistency.
Duration
In both cases, we're looking for a dx of how much the mean duration varies. I'm assuming that we're using mean age of active ties at each timestep (
meanage(t)
) as our estimate of mean duration.
Yes. Where ties that exist at the start of the network are assumed to have duration 0, when duration_imptd == FALSE
, and have an expected value of time added onto them based on the geometric distribution, when duration_imptd == TRUE
. This is functionality we talked about last year, and I added then.
So if we want to distinguish within vs between sim variation:
- within-sim, there is a distribution of
meanage(t)
across t, and we calculateSD_sim(meanage(t))
. To get a single estimate of this within-sim SD across simulations, would you take themean(SD_sim)
?
We don't report within-sim measures at all here. We used to include them in the combined measure, but no more. The issue is that, in the non-imputed version, most of the variation comes from the ramp-up, which is artifactual. And in the imputed version, the variance is biased down, because we give everyone the expected value of the geometric distribution, insead of a random draw from the gemoetric distribution. We contemplated the latter but decided against it in our discussion (again in #274).
- between-sim, we take the
mean_sim
=mean(meanage(t))
for each simulation, and calculateSD(mean_sim)
. Is that right?
Yes.
Formation and Dissolution
These both use the proportions as the basic measure: proportion of ties formed in empty dyads, and proportion of dyads that dissolve respectively
As is so often the case, formation
here referes to the terms in the formation
formula, not the actual process of formation. Because the data we use are cross-sectional structure and durations, with formation being based on the former conditional on the latter, then the so-called formation
diagnostics are actually about the target stats in the cross-sectional network structure over time.
Then the analogous measures to duration can be constructed within and between sims.
Yes. But there isn't the same problem here as there is with the durations, hence the reason we've kept the two different forms of variance in there (albeit in one combined metric).
Imputation
I understand how imputation is used in estimation to address missing data. I don't understand how imputation would be used in simulation, so I'm not sure why it is relevant here.
See above, and #274. This is so that the duration plot doesn't have the ramp-up that confused people and doesn't make all of the means in the diagnostics table in print.dx
uninterpretable, creating the impression that every good model is actually bad. People could see this in the plots if they understood what to look for, but not in the print.dx
diagnostics. And we can do this kind of imputation easily because all of our dissolution models have simple expectations based on the momeryless property that underlies them. BUT, I already know what you're thinking -- there are so many ways that our durations don't actually behave the way we think they should, etc. All of that comes from various forms of vital dynamics, which is why these diagnostics are done on networks with static node sets. These are all about just making sure that the model (which is estimated using a static node set) converged properly, and any approximations it employs are within the bounds of acceptable error, all before the complexity of simulating the full model with vital dyanmics, infection, etc. Diagnostics on those is a whole additional universe not covered here.
Right, as soon as you mentioned the imputation for censored start times I remembered. I'll think on this.
made a branch off plot_netdx_updates
that replaces Sim SD with the standard error of the mean (based on coda::effectiveSize
)
this is available even if # of sims = 1, and allows for interpretation of the difference between the sim mean and the target in terms of z score
@chad-klumb : can you clarify some things here:
- Is this for all three diagnostics (formation, dissolution, duration)?
- Is it giving a single measure that includes both within- and across-sim variatioon? (I'm guessing yes, which is why it is available even when sims=1).
- And most importantly, how does it deal with both the non-imputed and the imputed durations? I don't see how a single measure that includes within-sim variation would be able to both (1) prevent good models from looking bad in the non-imputed case and (2) not bias down the variation in the imputed case, given that we don't include variation in the imputation.
It is for all three.
The SE gives a single measure based on all the simulated data. (For now, missing data for dissolution/duration are simply dropped; it may be possible to improve on that.)
The duration table prepared in main
(and in the derived branches I made starting with main
) always includes imputation:
https://github.com/EpiModel/EpiModel/blob/afa0910f690fc055c801eebc1bdf8e8727d71f86/R/netdx.R#L685-L708
The branches I made also impute corrections stochastically for each edge, rather than just adding the expectation in, so the variation there should be appropriate (with the technical exception of dyad-dependent constraints, which as usual break separability, but usually not too much in our applications).
The duration table prepared in
main
(and in the derived branches I made starting withmain
) always includes imputation:
Oh, of course; sorry.
The branches I made also impute corrections stochastically for each edge, rather than just adding the expectation in, so the variation there should be appropriate (with the technical exception of dyad-dependent constraints, which as usual break separability, but usually not too much in our applications).
Cool.
The SE gives a single measure based on all the simulated data. (For now, missing data for dissolution/duration are simply dropped; it may be possible to improve on that.)
Ok, all in all I agree that this is a nice improvement on what we have now. That said, there was also a proposal on the table (woven through above comments) to split out the within- and across-sim variation. What are your thoughts on this, @chad-klumb? Would it be easy to do now that you're already deep in there? Is there a technical reason we haven't thought of why it wouldn't be appropriate? @martinamorris you were the biggest proponent of this split (IIRC) although I also see value in it. @smjenness thoughts as well?
I favor reporting the SE over other measures of variation because the SE is informative as to the significance of the difference Sim Mean - Target
. Other measures of variation could in principle be reported but wouldn't correspond as closely to the rest of the table we print out.
I agree that the SE is useful for capturing the variation in the estimate of the mean -- though its statistical interpretation depends on whether one knows the sampling distribution of the mean, and I'm not sure whether the sampling distribution of the netstats is known under all models. Still, using it as a rule of thumb makes sense to me. It would just be important to verify the sampling distribution before making principled statistical claims of significance.
What the SD offers is a way to capture the variation in the original data. For example: CV = SD/mean is an intuitive metric for understanding the magnitude of the "typical variation" of the points around the mean, expressed as a % of the mean. For example, "typical variation is about 3% of the mean" puts the variation back into the scale of the original variable.
They're both useful. Not sure we have to choose between them, as it's just one additional numerical summary to add.
I've lost track now -- did you resolve the within vs. between simulation variation summary question?
Taking the estimate of the SE as the actual standard deviation of the sampling distribution of the mean, the p-value (whether one- or two-sided) is bounded above by one over the Z score squared, c/o Chebyshev's Inequality.
The SE is one measure based on all the simulated data and is sensitive to variation both within and between simulations.
@chad-klumb that's a useful reminder, thx. I thought the formula was p $\approx$ 1/(SE^2) rather than 1/(z^2). The approximate p-value that Chebyshev's inequality gives can be quite conservative (larger) relative to the p-value from a Normal distribution, so if you know the CLT holds for your test statistic, you'd probably want to use the Normal distribution for significance testing. But this does raise the question of whether we'd want to output this stat as a bound for the p-value.
Re: within vs. between -- what I'm suggesting is that we might want estimates of each quantity, as well as the single summary stat that incorporates both. They both have interpretations of interest: how much variation is there in this stat over a single simulation, and how much variation is there in the mean of this stat between simulations.
In the near future, I can add (and document) columns in the stat tables for the statistic SD over the entirety of the simulated data, and (when nsims > 1) the SD across sims of the within-sim mean.
example script:
require(EpiModel)
nw <- network_initialize(n = 100)
nw <- set_vertex_attribute(nw, 'neighborhood', rep(1:10,10))
nw <- set_vertex_attribute(nw, 'position', rep(1:4,25))
formation <- ~edges+nodematch('neighborhood', diff=TRUE)
target.stats <- c(100,4,5,6,7,8,9,10,11,12,13)
coef.diss <- dissolution_coefs(dissolution =
~offset(edges)+offset(nodematch('neighborhood',diff=TRUE)),
duration = c(20,21,22,23,24,25,26,27,28,29,30))
est <- netest(nw, formation, target.stats, coef.diss, verbose = FALSE)
dx11 <- netdx(est, nsims = 15, nsteps = 100)
print(dx11)
with output
> print(dx11)
EpiModel Network Diagnostics
=======================
Diagnostic Method: Dynamic
Simulations: 15
Time Steps per Sim: 100
Formation Diagnostics
-----------------------
Target Sim Mean Pct Diff Sim SE Z Score SD(1-Sim Mean) SD(Statistic)
edges 100 100.125 0.125 0.730 0.172 5.111 9.141
nodematch.neighborhood.1 4 3.807 -4.817 0.173 -1.111 0.723 1.616
nodematch.neighborhood.2 5 4.802 -3.960 0.220 -0.899 0.845 2.067
nodematch.neighborhood.3 6 6.527 8.789 0.301 1.751 1.021 2.465
nodematch.neighborhood.4 7 6.897 -1.467 0.198 -0.518 1.281 2.177
nodematch.neighborhood.5 8 8.395 4.942 0.278 1.421 1.244 2.503
nodematch.neighborhood.6 9 9.160 1.778 0.310 0.517 2.057 3.048
nodematch.neighborhood.7 10 10.233 2.333 0.323 0.723 1.660 2.783
nodematch.neighborhood.8 11 11.057 0.521 0.284 0.202 1.631 2.793
nodematch.neighborhood.9 12 12.281 2.344 0.271 1.037 1.347 2.590
nodematch.neighborhood.10 13 12.203 -6.133 0.327 -2.438 1.907 2.948
Duration Diagnostics
-----------------------
Target Sim Mean Pct Diff Sim SE Z Score SD(1-Sim Mean) SD(Statistic)
edges 20 20.287 1.437 0.414 0.695 2.324 4.368
nodematch.neighborhood.1 21 21.753 3.587 1.082 0.696 5.720 11.150
nodematch.neighborhood.2 22 20.141 -8.450 0.890 -2.090 5.236 9.628
nodematch.neighborhood.3 23 22.026 -4.236 0.940 -1.036 3.732 9.338
nodematch.neighborhood.4 24 22.587 -5.888 0.901 -1.568 4.045 8.186
nodematch.neighborhood.5 25 25.535 2.141 0.959 0.558 6.040 9.075
nodematch.neighborhood.6 26 28.041 7.848 1.242 1.643 8.929 11.813
nodematch.neighborhood.7 27 28.670 6.186 0.898 1.861 8.658 10.863
nodematch.neighborhood.8 28 29.422 5.078 0.852 1.669 5.581 8.221
nodematch.neighborhood.9 29 28.597 -1.390 0.868 -0.464 3.966 7.142
nodematch.neighborhood.10 30 31.178 3.928 1.396 0.844 8.957 11.421
Dissolution Diagnostics
-----------------------
Target Sim Mean Pct Diff Sim SE Z Score SD(1-Sim Mean) SD(Statistic)
edges 0.050 0.048 -4.196 0.001 -1.431 0.005 0.056
nodematch.neighborhood.1 0.048 0.045 -6.461 0.003 -1.016 0.011 0.114
nodematch.neighborhood.2 0.045 0.048 5.869 0.003 0.891 0.010 0.112
nodematch.neighborhood.3 0.043 0.042 -4.475 0.002 -0.981 0.007 0.081
nodematch.neighborhood.4 0.042 0.041 -1.239 0.002 -0.259 0.007 0.079
nodematch.neighborhood.5 0.040 0.039 -3.547 0.002 -0.784 0.007 0.070
nodematch.neighborhood.6 0.038 0.037 -4.086 0.002 -0.967 0.007 0.066
nodematch.neighborhood.7 0.037 0.035 -4.817 0.002 -1.103 0.006 0.061
nodematch.neighborhood.8 0.036 0.036 1.495 0.001 0.377 0.007 0.058
nodematch.neighborhood.9 0.034 0.033 -5.317 0.001 -1.478 0.004 0.053
nodematch.neighborhood.10 0.033 0.034 1.905 0.001 0.495 0.003 0.050
and documentation
The columns are named Target, Sim Mean, Pct Diff, Sim SE, Z Score, SD(1-Sim Mean), and SD(Statistic). The Sim Mean column refers to the mean statistic value, across all time steps in all simulations in the dynamic case, and across all sampled networks in all simulations in the static case. The Sim SE column refers to the standard error in the mean, estimated using coda::effectiveSize. The Target column indicates the target value (if present) for the statistic, and the Pct Diff column gives (Sim Mean - Target)/Target when Target is present. The Z Score column gives (Sim Mean - Target)/(Sim SE). The SD(1-Sim Mean) column gives the empirical standard deviation across simulations of the mean statistic value within simulation, and SD(Statistic) gives the empirical standard deviation of the statistic value across all the simulated data.
This is looking really good. Lots of info, should cover most user contexts now.
One q about naming for the SD: it's not clear to me what the "1-" in SD (1-Sim Mean) is referring to. Maybe more clear to use SD (Mean) and SD (Statistic)?
This is great, thanks. I do find the names a bit unintuitive too, though. SD(1-Sim Mean)
is the standard deviation across runs, I assume? And SD(Statistic)
is the variation within runs? If so, how does the latter handle the multiple runs - calculating the SD within each run and then averaging them? Or is it meant to be the total SD including variation both within and across runs?
Either way, this matches my expectation, that most variation is within run rather than across. Which is, I think, as it should be with sufficiently long runs. For durations in partuclar, the variation over time within run is expected to be quite large, at least for moderatly small networks, so this is really sueful in allowing the user to see that there is high consistency across runs independently of that.
Sorry I posted all of that and only then noticed the documentation. Once I read that it all makes sense.
This is great, thanks. I do find the names a bit unintuitive too, though.
SD(1-Sim Mean)
is the standard deviation across runs, I assume? AndSD(Statistic)
is the variation within runs? If so, how does the latter handle the multiple runs - calculating the SD within each run and then averaging them? Or is it meant to be the total SD including variation both within and across runs?Either way, this matches my expectation, that most variation is within run rather than across. Which is, I think, as it should be with sufficiently long runs. For durations in partuclar, the variation over time within run is expected to be quite large, at least for moderatly small networks, so this is really sueful in allowing the user to see that there is high consistency across runs independently of that.
1-Sim Mean refers to the mean of a single simulation; SD(1-Sim Mean) refers to the standard deviation of the 1-Sim Mean across all simulations performed
SD(Statistic) is the standard deviation calculated when pooling all simulated data for the statistic in question (every time step is weighted equally in the dynamic case; every sampled network is weighted equally in the static case)
I have nothing against changing the names, although I find SD(Mean) slightly confusing as Sim SE is also the SD of a mean; the point of saying 1-Sim Mean was to emphasize that it is different from the overall Sim Mean. But I'm sure the names could be improved generally speaking (might require changing more than one of them, for consistency).
lol, ok, let me think on it a bit
change SD(1-Sim Mean) to SD(Sim Means)
leave SD(Statistic) as-is for now
check documentation is clear
@martinamorris and @sgoodreau : do you have any final comments on these updates before we merge in the associated PR (#705)?
Yes -- it would involve a change, and I'd like to discuss with you and @sgoodreau first.
@martinamorris : sounds good. I'll merge in @chad-klumb's changes from #705 when complete, and you can propose further edits in a new branch/PR.
@martinamorris I believe you had further edits with the PR associated with #705. If so, could you submit those in the next week? If they don't make it in by then, I'll merge this PR and you can feel free to open up a new issue/PR with further edits. Thanks!
Martina and I are meeting on Monday to discuss her ideas around this.
@chad-klumb I'd like to clarify one thing here, and reopen this issue if my understanding is correct. What I'd like to clarify is the relationship between the Sim SE and the SD(Sim Means).
From the netdx output and documentation:
> print(dx11) EpiModel Network Diagnostics ======================= Diagnostic Method: Dynamic Simulations: 15 Time Steps per Sim: 100 Formation Diagnostics ----------------------- Target Sim Mean Pct Diff Sim SE Z Score SD(Sim Means) SD(Statistic) edges 100 100.125 0.125 0.730 0.172 5.111 9.141 nodematch.neighborhood.1 4 3.807 -4.817 0.173 -1.111 0.723 1.616 nodematch.neighborhood.2 5 4.802 -3.960 0.220 -0.899 0.845 2.067 nodematch.neighborhood.3 6 6.527 8.789 0.301 1.751 1.021 2.465 nodematch.neighborhood.4 7 6.897 -1.467 0.198 -0.518 1.281 2.177 nodematch.neighborhood.5 8 8.395 4.942 0.278 1.421 1.244 2.503 nodematch.neighborhood.6 9 9.160 1.778 0.310 0.517 2.057 3.048 nodematch.neighborhood.7 10 10.233 2.333 0.323 0.723 1.660 2.783 nodematch.neighborhood.8 11 11.057 0.521 0.284 0.202 1.631 2.793 nodematch.neighborhood.9 12 12.281 2.344 0.271 1.037 1.347 2.590 nodematch.neighborhood.10 13 12.203 -6.133 0.327 -2.438 1.907 2.948
and documentation
The columns are named Target, Sim Mean, Pct Diff, Sim SE, Z Score, SD(Sim Means), and SD(Statistic). The Sim Mean column refers to the mean statistic value, across all time steps in all simulations in the dynamic case, and across all sampled networks in all simulations in the static case. The Sim SE column refers to the standard error in the mean, estimated using coda::effectiveSize. The Target column indicates the target value (if present) for the statistic, and the Pct Diff column gives (Sim Mean - Target)/Target when Target is present. The Z Score column gives (Sim Mean - Target)/(Sim SE). The SD(Sim Means) column gives the empirical standard deviation across simulations of the mean statistic value within simulation, and SD(Statistic) gives the empirical standard deviation of the statistic value across all the simulated data.
IIUC, we are reporting two different estimates of the SEM here:
- the
Sim SE
, which is based on the pooled data and the ESS. This can be used even with a single sim. - the
SD(Sim Means)
, which is based on the between-sim variation in the Sim Mean. This can only be used when there are multiple sims.
Is that correct? And if so, wouldn't we expect these to be roughly equivalent?
Sim SE is roughly equivalent to SD(Sim Means)/sqrt(# sims), regarding different sims as independent
@martinamorris : as discussed, please file a new issue regarding future revisions of these statistics. Thanks!