ergm
ergm copied to clipboard
Collinearity diagnostics for ergms
Putting this here for now:
Date: Fri, 19 Apr 2019 17:42:16 +0000 From: "Duxbury, Scott W." [email protected] To: "[email protected]" [email protected] Subject: Collinearity diagnostic for ERGM Hi Martina,
It was great to meet you today. Thanks again for the talk--it was very interesting. Attached is my paper developing a collinearity diagnostic for ERGM, and linked at the end of this email is my GitHub repository with R code for implementing the measure.
Best,
Scott Duxbury PhD Candidate Department of Sociology The Ohio State University
https://github.com/sduxbury/vif-ergm/blob/master/Collinearity%20test%20for%20ERGM.R R function to detect multicollinearity in ERGM. Contribute to sduxbury/vif-ergm development by creating an account on GitHub.
Michal's remaining TODO items
- [x] Do not assume there is
edges
in the model - [x] Do not assume the
edges
is a first term in the formula - [x] Implement GVIFs
- [ ] Add examples
- [ ] Documentation
- [ ] Now
source=
(defaults to"model"
) is passed tovcov.ergm()
. Let it optionally accept a positive integer so that the variance-covariance matrix is based on a sample sufficient statistics from that many networks (as in fact suggested in Duxbury 2018) - [ ] Iron-out
print.vif_ergm()
.
@martinamorris , i'm on it.
Not quite VIF, but it's a start. I still need to implement tests for it.
I started working on this on mbojan/ergm@i88-vif but had to pause because the end of semester cought up with me.
I do have a question though @krivit @martinamorris . To calculate VIFs Scott's prototype code uses fit@vcov
but the paper advertises rather using the correlation matrix based on simulating networks from the model. These are not the same. What's actually the proper approach here?
@krivit will need to answer this one :)
Are they inverses of each other (exact or approximate), by any chance?
For now, I think we need to resolve several issues before incorporating the code, if we want to do that in the first place.
Package car
has a vif
generic and a vif.default()
method that works for any object that can return a model.matrix
. However, it makes an assumption that the intercept coefficient is named "(Intercept)
", which doesn't apply in our case. The current implementation of vif_ergm
makes an assumption that the intercept effect is the first effect in the model. This is not necessarily the case.
We can autodetect the intercept (or the set of statistics that are collectively the intercept) on the MPLE predictor matrix, by finding those columns for which a linear combination exists that produces a vector of 1s. (I am surprised that car::vif.default()
doesn't do that.) I am not sure how we would do that with the MCMC sample, without attaching an edge count/sum statistic to the returned change statistics (which may not be that bad an idea).
Given that, one can use vcov(ergm.fit, sources="model")
to get the variance-covariance matrix of the parameters.
Anyway, I don't think it's something we need to incorporate into 3.11.
Pulling car
as a dependency just for the generic seems like an overkill to me, it is a rather big package with dependencies unrelated to ergm
(e.g. lme4
, maptools
, quantreg
and so on). Moreover, IMHO the car:::vif.default()
method is not "default" enough for us to mold an ergm
object to take advantage of it: we do not have model.matrix()
method, there are no terms
in a sense compatible with stats::glm()
and so on. It would be nice to have a generic generic vif(object, ...) UseMethod("vif")
in stats
or generics
(the latter is not promising...). At this moment I am opting for a dedicated vif_ergm()
(or other name if anybody has suggestions).
Starting with Scotts code I ended-up rewriting the code from scratch in order to:
- do not assume that there is
edges
term in the formula - do not assume that the
edges
term is a first term in the formula - implement generalized VIFs that are calculated per term -- a single value for, say,
nodefactor()
term even if it has several parameters for different levels.
The usefulness and sensibility of the VIFs in the ERGM context is still to be proven, especially vis a vis endogeneous terms, diverse model specifications and so on. I have decided to give the user as much control as possible over how the VCov matrix is subsetted before actual VIF calculation. That involves, among other things, removing rows/cols corresponding to parameters of provided term(s). I think we need a canonical "term labeler" (hence #220) because it is needed here, and most likely in other places, to:
- Produce pretty term (not coef) labels for output
- The way for a user to refer to terms of a fitted model
I have implemented a naive labeler which basically returns the deparsed call from the model formula.
Comments, suggestions, alternative ideas are more than welcome, @krivit @martinamorris @drh20drh20 @CarterButts and whomever it may concern.
This is not yet merged and with several rough edges, but see it in action in the examples below:
data("faux.mesa.high")
# Largish model formula with more than one term of the same type
frm <- faux.mesa.high ~ nodefactor("Grade") + nodefactor("Race") +
nodefactor("Sex") + nodematch("Grade",diff=TRUE) +
nodematch("Race",diff=TRUE, levels=-c(1,4)) + nodematch("Sex",diff=FALSE) + edges
fit <- ergm(frm)
and now (print.vif_ergm()
still to be developed):
No terms are dropped
vif_ergm_new(fit)
## $vif_coef
## coef vif df vif_sqrt
## 1 nodefactor.Grade.8 62.119735 1 7.881607
## 2 nodefactor.Grade.9 44.932046 1 6.703137
## 3 nodefactor.Grade.10 22.078901 1 4.698819
## 4 nodefactor.Grade.11 32.952449 1 5.740422
## 5 nodefactor.Grade.12 15.890923 1 3.986342
## 6 nodefactor.Race.Hisp 22.637335 1 4.757871
## 7 nodefactor.Race.NatAm 17.970439 1 4.239155
## 8 nodefactor.Race.Other 1.066925 1 1.032921
## 9 nodefactor.Race.White 4.130202 1 2.032290
## 10 nodefactor.Sex.M 2.390419 1 1.546098
## 11 nodematch.Grade.7 97.139873 1 9.855956
## 12 nodematch.Grade.8 15.733629 1 3.966564
## 13 nodematch.Grade.9 6.726168 1 2.593486
## 14 nodematch.Grade.10 3.335544 1 1.826347
## 15 nodematch.Grade.11 5.797696 1 2.407840
## 16 nodematch.Grade.12 2.578454 1 1.605756
## 17 nodematch.Race.Hisp 6.151197 1 2.480161
## 18 nodematch.Race.NatAm 5.384731 1 2.320502
## 19 nodematch.Race.White 1.487202 1 1.219509
## 20 nodematch.Sex 2.864719 1 1.692548
## 21 edges 303.297061 1 17.415426
##
## $vif_term
## term vif df vif_sqrt
## 1 nodefactor("Grade") 2.465512e+05 5 3.460913
## 2 nodefactor("Race") 6.849846e+02 4 2.261831
## 3 nodefactor("Sex") 2.390419e+00 1 1.546098
## 4 nodematch("Grade", diff = TRUE) 4.345004e+05 6 2.950071
## 5 nodematch("Race", diff = TRUE, levels = -c(1, 4)) 3.817385e+01 3 1.834964
## 6 nodematch("Sex", diff = FALSE) 2.864719e+00 1 1.692548
## 7 edges 3.032971e+02 1 17.415426
##
## attr(,"class")
## [1] "ergm_vif"
Drop edges
term
(probably what one would do for this model)
vif_ergm_new(fit, drop_terms = "edges")
## $vif_coef
## coef vif df vif_sqrt
## 1 nodefactor.Grade.8 49.952154 1 7.067684
## 2 nodefactor.Grade.9 36.315347 1 6.026222
## 3 nodefactor.Grade.10 19.389272 1 4.403325
## 4 nodefactor.Grade.11 28.278523 1 5.317755
## 5 nodefactor.Grade.12 14.339479 1 3.786750
## 6 nodefactor.Race.Hisp 9.978772 1 3.158920
## 7 nodefactor.Race.NatAm 9.322576 1 3.053289
## 8 nodefactor.Race.Other 1.061432 1 1.030258
## 9 nodefactor.Race.White 3.351882 1 1.830815
## 10 nodefactor.Sex.M 1.100157 1 1.048884
## 11 nodematch.Grade.7 61.414244 1 7.836724
## 12 nodematch.Grade.8 13.215291 1 3.635284
## 13 nodematch.Grade.9 5.959511 1 2.441211
## 14 nodematch.Grade.10 3.186973 1 1.785210
## 15 nodematch.Grade.11 5.332741 1 2.309273
## 16 nodematch.Grade.12 2.507388 1 1.583473
## 17 nodematch.Race.Hisp 4.518622 1 2.125705
## 18 nodematch.Race.NatAm 4.203501 1 2.050244
## 19 nodematch.Race.White 1.459541 1 1.208115
## 20 nodematch.Sex 1.021836 1 1.010859
##
## $vif_term
## term vif df vif_sqrt
## 1 nodefactor("Grade") 91630.055413 5 3.134756
## 2 nodefactor("Race") 21.798404 4 1.469952
## 3 nodefactor("Sex") 1.100157 1 1.048884
## 4 nodematch("Grade", diff = TRUE) 89456.359111 6 2.586034
## 5 nodematch("Race", diff = TRUE, levels = -c(1, 4)) 18.958127 3 1.632924
## 6 nodematch("Sex", diff = FALSE) 1.021836 1 1.010859
##
## attr(,"class")
## [1] "ergm_vif"
Drop terms edges
and nodematch
for Grade
vif_ergm_new(fit, drop_terms = c("edges", 'nodematch("Grade", diff = TRUE)'))
## $vif_coef
## coef vif df vif_sqrt
## 1 nodefactor.Grade.8 3.264765 1 1.806866
## 2 nodefactor.Grade.9 4.165662 1 2.040995
## 3 nodefactor.Grade.10 4.155404 1 2.038481
## 4 nodefactor.Grade.11 3.995176 1 1.998794
## 5 nodefactor.Grade.12 4.079305 1 2.019729
## 6 nodefactor.Race.Hisp 9.642242 1 3.105196
## 7 nodefactor.Race.NatAm 9.152473 1 3.025305
## 8 nodefactor.Race.Other 1.047986 1 1.023712
## 9 nodefactor.Race.White 3.173251 1 1.781362
## 10 nodefactor.Sex.M 1.040865 1 1.020228
## 11 nodematch.Race.Hisp 4.402948 1 2.098320
## 12 nodematch.Race.NatAm 4.149227 1 2.036965
## 13 nodematch.Race.White 1.425276 1 1.193849
## 14 nodematch.Sex 1.019944 1 1.009923
##
## $vif_term
## term vif df vif_sqrt
## 1 nodefactor("Grade") 1.033272 5 1.003278
## 2 nodefactor("Race") 18.288827 4 1.438047
## 3 nodefactor("Sex") 1.040865 1 1.020228
## 4 nodematch("Race", diff = TRUE, levels = -c(1, 4)) 17.887558 3 1.617181
## 5 nodematch("Sex", diff = FALSE) 1.019944 1 1.009923
##
## attr(,"class")
## [1] "ergm_vif"
We can autodetect the intercept (or the set of statistics that are collectively the intercept) on the MPLE predictor matrix, by finding those columns for which a linear combination exists that produces a vector of 1s. (I am surprised that car::vif.default() doesn't do that.) I am not sure how we would do that with the MCMC sample, without attaching an edge count/sum statistic to the returned change statistics (which may not be that bad an idea).
I like that a lot! It will be easy to integrate as I'd only need the vector of coef/term names of these columns.
Technically, we can Enhances
car
---though the user would need to install it to get the VIF capability, for no clear reason.
No real comments from me. If people think it's useful, and it doesn't take up too much additional space or complicate the code excessively, it may as well go in. We really should do the whole modularisation (#186) sooner than later, though, and then we'll have a relatively lightweight package into which we can incorporate all the diagnostics utilities we can imagine.
Technically, we can
Enhances
car
---though the user would need to install it to get the VIF capability, for no clear reason.
Yeah... so not really a solution. Not to mention inflated R CMD check time because of all additional deps...
No real comments from me. If people think it's useful, and it doesn't take up too much additional space or complicate the code excessively, it may as well go in. We really should do the whole modularisation (#186) sooner than later, though, and then we'll have a relatively lightweight package into which we can incorporate all the diagnostics utilities we can imagine.
It is a rather isolated piece with no extra dependencies so I guess there will be no problem to move it around when the time comes.
An alternative to drop_terms
expecting a character vector is a formula, just like gof(GOF=)
argument.