Consider calling glm.fit() instead of glm()
In
https://github.com/statnet/ergm/blob/1f4401ed73356cbf89c2f35fd35d6a981f4caea0/R/ergm.mple.R#L100-L101
consider calling glm.fit() directly rather than glm(). Experiments with biggish data show that it might cut the computing time by half.
@mbojan , can you test to see if this works?
@mbojan , I'll be submitting an update to ergm in the next few days. If you want this to go in, let me know ASAP.
Oh dear, I have a week of workshops, including ergms. Can we release next week? The principle answer is yes, but haven't tested yet.
Oh dear, I have a week of workshops, including ergms. Can we release next week? The principle answer is yes, but haven't tested yet.
OK, can you get it done in the next day or two?
@mbojan ?
@AdrienLeGuillou , you often fit MPLE to large networks, right? Can you by any chance test this?
I just ran a quick test on a smaller 10k nodes network using this branch. It worked fine. I can't tell if it was faster or not as I usually work with "Stochastic-Approximation" on these smaller local tests. I can try on the HPC with our 3 - 100k nodes networks and compare the time it takes.
I just realized that ergm.mple is called whatever the main.method we use.
Therefore I can confirm that it works perfectly on our 10k nodes networks.
It takes a very similar amount of time to fit the networks with both version as the MPLE step is not the longest part anyways.
I confirm it also works on the 100k nodes network.
It was actually longer with glm.fit. But the difference was on the number of MCMLE iterations.
Thanks @AdrienLeGuillou . @krivit don't merge, leave as is. I need to dig out the script where I think I noticed the difference.
I confirm it also works on the 100k nodes network. It was actually longer with
glm.fit. But the difference was on the number of MCMLE iterations.
A few things to try:
- Run
set.seed(0)(or some other number) before theergm()call. I don't think the GLM code has any stochastic elements, so which variant is used shouldn't make a difference. - Run
Rprof()before running the test code,Rprof(NULL)after; thensummaryRprof()should tell you how much time is being spent inergm.mple().
I just run a few tests on the HPC with a 100k nodes with set.seed(0) and Rprof.
I don't see any difference.
total.time total.pct self.time self.pct
"ergm.mple" 0.4 0.08 0 0
That was the last run with the @i120-glm-fit branch.
Both branches gives total times between 0.4 and 0.6 .
I think that for big networks this is not very important as the overhead of glm instead of glm.fit is quickly surpassed by the actual computation time.
For smaller networks it probably makes a lot more sense.
For reference, this is the formula used for the network:
model_main <- ~ edges +
nodematch("age.grp", diff = TRUE) +
nodefactor("age.grp", levels = -1) +
nodematch("race", diff = FALSE) +
nodefactor("race", levels = -1) +
nodefactor("deg.casl", levels = -1) +
concurrent +
degrange(from = 3) +
nodematch("role.class", diff = TRUE, levels = c(1, 2))
@krivit @AdrienLeGuillou Thanks for investigating. I can't find the usecase in which I think noticed that effect. Did you look at the possible effect on memory footprint too? I'd say we can declare this issue as "unconfirmed" and let it rest.
@mbojan glm would use a bit more memory, but nothing relevant compared to the memory used by the data and fitting routines. Either on my machine or on HPC I could not detect a difference with simply htop.
Closing