ergm
ergm copied to clipboard
In ergm(), rename TARGET_STATS to something more aesthetic and less likely to be used as a variable name?
Right now, if ergm()
is given target.stats=
, it runs san()
to obtain the LHS network, calls it TARGET_STATS
, then replaces the LHS of the model formula with TARGET_STATS
(the name), augmenting its environment to contain said variable.
There are two reasons to change this:
- If
TARGET_STATS
is a variable used outside ofergm()
, that could create a subtle bug, since the formula would no longer be able to see it. - R variable names can technically be any strings, if quoted. For example,
We can therefore use an expression that is more aesthetically pleasing (say, for> `Hello World!` <- 2+2 > ls() [1] "Hello World!" > `Hello World!` [1] 4
summary()
output) or is more clearly meta-syntactic.
And so, I have two questions:
- Is anyone, particularly @statnet/epimodel developers, currently relying on
TARGET_STATS
? - What names would be good? Some ideas so far:
-
<target stats> ~ edges
-
[target stats] ~ edges
-
@krivit : we are not using TARGET_STATS
nor the two proposed alternatives. I agree the alternatives may be safer.
@mbojan Do you mean in the original ergm call? A network is needed on the LHS in order to provide all of the info about the meta-information about the network and, most importantly, all of the attribute values. SInce the latter are of arbitrary number and type it would be quite messy (and redudnant) to recreate all of that info in another format.
Right. Disregard my comment. Got carried away by a quest to overload the current interface...
Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?
Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?
It's because we are fitting to statistic, not to the network. At least, that's the reasoning.
Is this really a problem? It will happen only if the symbol TARGET_STATS
is used by the user in the ergm formula with the intention of referring to an object with that name in, say, global environment. That sounds rather unlikely to me. If we think it is likely enough we can throw an error/warning if symbol TARGET_STATS
has been used in the formula by the user. For reference, the source code of interest is here:
https://github.com/statnet/ergm/blob/554b40d09fa64e217af0a0352266b7df25761900/R/ergm.R#L581-L603
A separate issue is about the output: what should summary(ergmfit)
show for a model fit to vector of sufficient statistics rather than a network object. At this moment I believe you can't tell one from the other.
I think that in practice, the benefits will be largely aesthetic. In particular, summary.ergm()
would show [target stats] ~ edges
rather than TARGET_STATS ~ edges
and similar.
In the output IMHO [target stats]
or <target stats>
would look most "meta-" in the sense of not suggesting it is an R object to be interacted with, thus preferable.
I think one would want to avoid including whitespace in the name, as this can complicate parsing. Is the intent for this to be user-exposed? If so, we should let them use their own name and detect the object type after parsing. Otherwise, the current scheme seems OK. TARGET_STATS_LHS is uglier, but would be pretty collision-proof, if that is the concern. Whole thing seems a bit in the not broke/don't fix category.
Hmm. Calling a network "target stats" (in any format) seems a bit odd to me. Needless to say I hadn't realized we were already doing that. Would it make sense to use something like "target net" instead?
It's because we are fitting to statistic, not to the network. At least, that's the reasoning.
I get that the input is target stats. But really, what we're fitting to is a network with those target stats. If we could fit to the target stats directly we wouldn't need the SAN step.
My suggestion preserves the "target" language and clarifies that a network is still on the LHS.
Why not consistently allow functions that pull a network from the LHS to accept a basis
argument (overriding the LHS if present), as already done for simulate()
and summary()
? That would eliminate the need to modify the formula or its environment (and hence to worry about the name we give the san
network). If done for ergm()
, it would also allow for the removal of this line in EpiModel
https://github.com/EpiModel/EpiModel/blob/7a3dc83197c5398044a2688c2182be2733a5d4d3/R/netest.R#L225
which currently uses the much-more-likely-to-be-a-problem name nw
.
Actually, it looks like ergm()
had this functionality added already in
https://github.com/statnet/ergm/commit/8e01faed16f8d96d3def8740929b0910aacf66d1.
As far as print.summary.ergm
with print.formula = TRUE
goes, it seems we have some inconsistency in how target.stats
and basis
are handled. With ergm@master
, the following script
require(ergm)
nw1 <- network(10, directed = FALSE)
nw2 <- network(10, directed = FALSE)
e1 <- ergm(nw1 ~ edges)
s1 <- summary(e1)
print(s1, print.formula = TRUE)
e2 <- ergm(nw1 ~ edges, target.stats = 10)
s2 <- summary(e2)
print(s2, print.formula = TRUE)
e3 <- ergm(nw1 ~ edges, basis = nw2)
s3 <- summary(e3)
print(s3, print.formula = TRUE)
produces (in part) the respective formula printouts
Formula:
nw1 ~ edges
Formula:
TARGET_STATS ~ edges
Formula:
nw1 ~ edges
So, even though nw2
is used as the basis for e3
, the formula printed for s3
is nw1 ~ edges
.
If the desire is to have print.summary.ergm
with print.formula = TRUE
print some representation of the "effective" ergm()
formula (rather than the thing actually passed to ergm()
as the formula
argument), then we should probably change the basis
printing behavior.
For consistency and simplicity, we could do the following:
- if neither
target.stats
norbasis
is used, simply print the originalformula
argument toergm()
- if either
target.stats
orbasis
is used, print the corresponding argument name (i.e.,target.stats
orbasis
, respectively), bracketed in some way if we decide we want that, followed by~
and the RHS of theformula
argument toergm()
.
Note that this does not require modifying formula
or its environment.
(Another option, which also does not require modifying formula
or its environment, would be to have print.formula = TRUE
always print the original formula
argument to ergm()
, understanding that basis
and/or target.stats
can be printed as part of the call
.)
As I see it, modifying the formula as in
https://github.com/statnet/ergm/blob/61ebeea9e83cf906a39f60c253ede05f77da03b6/R/ergm.R#L603
and then subsequently calling ergm_model
on it (as in, e.g., a later logLik()
or simulate()
call) is fundamentally a bug (whether TARGET_STATS
or some non-syntactic alternative is used as the network name) that should be avoided if possible, and it does seem that it should be avoidable here.
Thanks! In terms of programming, having an "illegal" name is not a problem, since R variable names can be anything.
For example, here I changed the implementation to use target stats
(with a space), which passes the check and prudces the following (truncated) output:
library(ergm)
nw1 <- network(10, directed = FALSE)
nw2 <- network(10, directed = FALSE)
e1 <- ergm(nw1 ~ edges)
s1 <- summary(e1)
print(s1, print.formula = TRUE)
#> nw1 ~ edges
e2 <- ergm(nw1 ~ edges, target.stats = 10)
s2 <- summary(e2)
print(s2, print.formula = TRUE)
#> `target stats` ~ edges
e3 <- ergm(nw1 ~ edges, basis = nw2)
s3 <- summary(e3)
#> nw1 ~ edges
All that having been said, I suppose that aesthetics are not that important, since by default we print the call, not the formula. (I think we used to print the formula.)