Slower speed vs basic edgelist in some tasks
Thanks for all your work building & maintaining this package.
I was exploring custom network generating functions directly in R, and started benchmarking speeds. To my surprise, working directly with edge lists was faster than several igraph functions.
For example, in one use-case (percolation model), I didn't need adjacent vertices as a list (by vertex) but just the total "unlisted" vector. So I wrote adjacent.i function below, which was 10-100 times faster for this task.
I don't raise this as criticism, as I understand there are many features like attributes which add overhead. But in some applications, users may want to explore whether simple edge lists in R may be faster.
# config
library('igraph')
library('microbenchmark')
options(width=200)
# functions
gen.edgelist = function(k.i){ # random graph from degree distribution
i.e = sample(rep(seq(length(k.i)),times=k.i)) # random vertices, rep by degree
ii.e = do.call(cbind,split(i.e,1:2)) # egde list (random pairs)
}
adjacent.i = function(ii.e,i){ # get adjacent vertices from edgelist
i.adj = c(ii.e[ii.e[,1] %in% i,2], ii.e[ii.e[,2] %in% i,1])
}
# graph generation
N = 999 # number of vertices
k.i = rgamma(N,1,0.1) # degree distribution
ii.e = gen.edgelist(k.i) # edge list
G = graph_from_edgelist(ii.e,directed=FALSE) # graph object
# microbenchmark: get adjacent vertices
s = sample(seq(N),round(0.5*N)) # random sample
microbenchmark(times=100,
{k1 = unlist(adjacent_vertices(G,s))},
{k2 = adjacent.i(ii.e,s)})
print(all(sort(k1)==sort(k2))) # check same result (maybe different order)
# plot
# plot(G,vertex.label=NA,vertex.size=50/sqrt(N)) # plot, if you want
Interesting. And still faster than V(G)[.nei(s)]... Some comments:
- Your code doesn't work with
set.seed(1)-- the network-generating mechanism is faulty. - You might want to try
sample_degseq()for similar logic of network generation (beware non-graphical degree sequences though), and perhaps alsosimplify()the result to drop multi-edges. - Are you by any occasion implementing percolation-like mechanisms into a package? I'd be interested in that.
- Are you by any occasion implementing percolation-like mechanisms into a package?
The Marthematica interface of igraph already has a fast percolation function, and in time I might port it to the igraph core.
Thanks @mbojan!
- Ah yes, for MWE I removed the check
sum(k.i) %% 2 == 0--- i.e. even total degree sum. A quick fix isi.fix = sample.int(len(k.i),1); d.i[i.fix] = d.i[i.fix] + sum(k.i) %% 2, which adds 1 to a random vertex if the total degree sum is odd. - Similarly, my real graph generation algorithms are more complex, this just for the MWE.
- In general, I am interested in contributing, but I don't have much time for that sort of thing these days.
@szhorvat interesting, and thanks. Although I haven't reviewed the link in detail, in general we (a research group) often don't use such built-in functions, since we need to add all sorts of custom adjustments within the solver. e.g. We are simulating epidemics, and we might add vaccination of X people per day, during days t0--t1, or reducing the probability of transmission per contact due to some other intervention, etc.
This is related to #194.
The bad performance is not in the C core but in creating vertex sequences, which is slow. Setting igraph_options(return.vs.es=F) will make it fast.
@jessexknight, note that:
i.adj = c(ii.e[ii.e[,1] %in% i,2], ii.e[ii.e[,2] %in% i,1])
is limited to undirected graphs. For directed graphs, the adjacent vertices must be selected by:
c(t(ii.e[which(ii.e[,1]==i),]))
BTW instead of gen.edgelist you should use sample_degseq with the simple method, which is equivalent. The simple.no.multiple.uniform method is the same but rejects non-simple graphs.
Yes, these are bad names which have already been changed in the C core. https://igraph.org/c/html/develop/igraph-Generators.html#igraph_degree_sequence_game
The bad performance is not in the C core but in creating vertex sequences, which is slow. Setting
igraph_options(return.vs.es=F)will make it fast.
Observe the following difference in performance:
# Berger/circle schedule for round robin tournaments
# add edges after each round
igraph_options(return.vs.es=F)
system.time({
n <- 2000L; d <- n %/% 2L
n <- max(n - !(n %%2L), 1L)
r <- c(rbind((seq(from = 0, to = -(n+1)/2+1) %% n) , seq(from=0, to=(n-1)/2))) # diagonal in Zn
g <- make_empty_graph(n)
for (i in seq_len(.Machine$integer.max)) {
g <- add_edges(g, r+1L)
r <- ((r+(n-1L) %/% 2L) %% n)
if (i >= d) break
}
})
ecount(g)
# ca 30 sec
# append rounds to vector and then create graph
system.time({
n <- 2000L; d <- n %/% 2L
n <- max(n - !(n %%2L), 1L)
r <- c(rbind((seq(from = 0, to = -(n+1)/2+1) %% n) , seq(from=0, to=(n-1)/2))) # diagonal in Zn
es <- c() # edges: sequence of vertex ids
for (i in seq_len(.Machine$integer.max)) {
es <- append(es, r)
r <- ((r+(n-1L) %/% 2L) %% n)
if (i >= d) break
}
g <- graph(es+1L)
})
ecount(g)
[1] 1e+06
ca 3 seconds
I repeated the microbenchmark proposed by @jessexknight with and without igraph_options(return.vs.es=F) to get a baseline comparison; here are the results:
> microbenchmark(times=100,
+ {k1 = unlist(adjacent_vertices(G,s))},
+ {k2 = adjacent.i(ii.e,s)})
Unit: microseconds
expr min lq mean median uq max neval
{ k1 = unlist(adjacent_vertices(G, s)) } 4202.787 4260.9660 4903.0174 4280.748 5175.922 7606.361 100
{ k2 = adjacent.i(ii.e, s) } 164.861 168.6945 178.3557 172.856 180.072 277.693 100
> igraph_options(return.vs.es=F)
> microbenchmark(times=100,
+ {k1 = unlist(adjacent_vertices(G,s))},
+ {k2 = adjacent.i(ii.e,s)})
Unit: microseconds
expr min lq mean median uq max neval
{ k1 = unlist(adjacent_vertices(G, s)) } 165.189 181.261 192.8747 197.0665 199.3625 329.312 100
{ k2 = adjacent.i(ii.e, s) } 184.664 203.565 215.4288 219.2680 225.0285 284.089 100
So indeed this is a manifestation of #194, although I am yet to understand why return.vs.es=T means such a significant overhead.
I've spent a bit more time debugging the vs/es creation process and commenting bits and pieces in and out to see how they would affect performance. It turns out that a disproportionate amount of time is spent in the vs/es creation process when we set up a weak reference that points back to the graph that the vs/es was created from (to prevent the user from using a vs/es object with another graph). For instance, simply removing the part of the code that adds the weak reference in unsafe_create_vs() brings the median execution time of unlist(adjacent_vertices(G, s)) down from 4280 us to 844 us (but then two of the tests fail and I need to check why). Even removing the condition is_igraph(graph) && !warn_version(graph) (which is used for backward compatibility checks) from the internal get_vs_ref function trims the median execution time from 4280 us to 2798 us.
So, all in all, creating vs and es objects is expensive due to the additional bookkeeping we need to do in order to associate a vs/es object to the graph it was created from, and it is unclear to me yet whether we could do away with some of these to gain some performance here.
Replacing "igraph" %in% class(graph) with inherits(graph, "igraph") in the is_igraph() function also yields some minor speedup (from 4280 usec to 3800 usec).
Thanks for following-up to diagnose @ntamas -- To be honest I've all-but abandoned R-igraph for the time being, with the exception of needing layout_with_fr, as most of the interesting stuff I'm doing is building up the graph, for which I need a bunch of custom functions anyways. If helpful for any future readers, you can see the tiny module here.
EDIT (2022-12-13): for those interested, I ended up implementing layout_with_fr in base R here with an option to gganimate the evolving layout as shown here.
Scheduling this for a future hackathon as the underlying issue with weakref management needs some additional R expertise beyond my skills.
Self-contained reprex:
library(igraph)
set.seed(20221213)
gen.edgelist <- function(k.i) { # random graph from degree distribution
i.e <- sample(rep(seq(length(k.i)), times = k.i)) # random vertices, rep by degree
ii.e <- do.call(cbind, split(i.e, 1:2)) # egde list (random pairs)
}
adjacent.i <- function(ii.e, i) { # get adjacent vertices from edgelist
i.adj <- c(ii.e[ii.e[, 1] %in% i, 2], ii.e[ii.e[, 2] %in% i, 1])
}
# graph generation
N <- 999 # number of vertices
k.i <- rgamma(N, 1, 0.1) # degree distribution
ii.e <- gen.edgelist(k.i) # edge list
#> Warning in split.default(i.e, 1:2): data length is not a multiple of split
#> variable
#> Warning in (function (..., deparse.level = 1) : number of rows of result is not
#> a multiple of vector length (arg 2)
G <- graph_from_edgelist(ii.e, directed = FALSE)
s <- sample(seq(N), round(0.5 * N)) # random sample
bench::mark(
iterations = 100,
check = FALSE,
{
k1 <- unlist(adjacent_vertices(G, s))
},
{
k2 <- adjacent.i(ii.e, s)
}
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_al…¹
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:by>
#> 1 { k1 <- unlist(adjacent_vertices(G, s)) } 4.61ms 5.3ms 177. 358KB
#> 2 { k2 <- adjacent.i(ii.e, s) } 164.7µs 176.9µs 4721. 296KB
#> # … with 1 more variable: `gc/sec` <dbl>, and abbreviated variable name
#> # ¹mem_alloc
igraph_options(return.vs.es = F)
bench::mark(
iterations = 100,
check = FALSE,
{
k1 <- unlist(adjacent_vertices(G, s))
},
{
k2 <- adjacent.i(ii.e, s)
}
)
#> # A tibble: 2 × 6
#> expression min median itr/s…¹ mem_a…² gc/se…³
#> <bch:expr> <bch> <bch:> <dbl> <bch:b> <dbl>
#> 1 { k1 <- unlist(adjacent_vertices(G, s)) } 151µs 168µs 5429. 37.2KB 0
#> 2 { k2 <- adjacent.i(ii.e, s) } 147µs 166µs 5566. 296.5KB 0
#> # … with abbreviated variable names ¹`itr/sec`, ²mem_alloc, ³`gc/sec`
Created on 2022-12-13 with reprex v2.0.2