rigraph icon indicating copy to clipboard operation
rigraph copied to clipboard

Slower speed vs basic edgelist in some tasks

Open jessexknight opened this issue 3 years ago • 15 comments

Thanks for all your work building & maintaining this package.

I was exploring custom network generating functions directly in R, and started benchmarking speeds. To my surprise, working directly with edge lists was faster than several igraph functions.

For example, in one use-case (percolation model), I didn't need adjacent vertices as a list (by vertex) but just the total "unlisted" vector. So I wrote adjacent.i function below, which was 10-100 times faster for this task.

I don't raise this as criticism, as I understand there are many features like attributes which add overhead. But in some applications, users may want to explore whether simple edge lists in R may be faster.

# config
library('igraph')
library('microbenchmark')
options(width=200)
# functions
gen.edgelist = function(k.i){ # random graph from degree distribution
  i.e = sample(rep(seq(length(k.i)),times=k.i)) # random vertices, rep by degree
  ii.e = do.call(cbind,split(i.e,1:2)) # egde list (random pairs)
}
adjacent.i = function(ii.e,i){ # get adjacent vertices from edgelist
  i.adj = c(ii.e[ii.e[,1] %in% i,2], ii.e[ii.e[,2] %in% i,1])
}
# graph generation
N    = 999 # number of vertices
k.i  = rgamma(N,1,0.1) # degree distribution
ii.e = gen.edgelist(k.i) # edge list
G    = graph_from_edgelist(ii.e,directed=FALSE) # graph object
# microbenchmark: get adjacent vertices
s = sample(seq(N),round(0.5*N)) # random sample
microbenchmark(times=100,
  {k1 = unlist(adjacent_vertices(G,s))},
  {k2 = adjacent.i(ii.e,s)})
print(all(sort(k1)==sort(k2))) # check same result (maybe different order)
# plot
# plot(G,vertex.label=NA,vertex.size=50/sqrt(N)) # plot, if you want

jessexknight avatar Jul 29 '22 16:07 jessexknight

Interesting. And still faster than V(G)[.nei(s)]... Some comments:

  • Your code doesn't work with set.seed(1) -- the network-generating mechanism is faulty.
  • You might want to try sample_degseq() for similar logic of network generation (beware non-graphical degree sequences though), and perhaps also simplify() the result to drop multi-edges.
  • Are you by any occasion implementing percolation-like mechanisms into a package? I'd be interested in that.

mbojan avatar Jul 29 '22 17:07 mbojan

  • Are you by any occasion implementing percolation-like mechanisms into a package?

The Marthematica interface of igraph already has a fast percolation function, and in time I might port it to the igraph core.

szhorvat avatar Jul 29 '22 17:07 szhorvat

Thanks @mbojan!

  1. Ah yes, for MWE I removed the check sum(k.i) %% 2 == 0 --- i.e. even total degree sum. A quick fix is i.fix = sample.int(len(k.i),1); d.i[i.fix] = d.i[i.fix] + sum(k.i) %% 2, which adds 1 to a random vertex if the total degree sum is odd.
  2. Similarly, my real graph generation algorithms are more complex, this just for the MWE.
  3. In general, I am interested in contributing, but I don't have much time for that sort of thing these days.

jessexknight avatar Jul 29 '22 17:07 jessexknight

@szhorvat interesting, and thanks. Although I haven't reviewed the link in detail, in general we (a research group) often don't use such built-in functions, since we need to add all sorts of custom adjustments within the solver. e.g. We are simulating epidemics, and we might add vaccination of X people per day, during days t0--t1, or reducing the probability of transmission per contact due to some other intervention, etc.

jessexknight avatar Jul 29 '22 18:07 jessexknight

This is related to #194.

The bad performance is not in the C core but in creating vertex sequences, which is slow. Setting igraph_options(return.vs.es=F) will make it fast.

szhorvat avatar Jul 30 '22 09:07 szhorvat

@jessexknight, note that:

i.adj = c(ii.e[ii.e[,1] %in% i,2], ii.e[ii.e[,2] %in% i,1])

is limited to undirected graphs. For directed graphs, the adjacent vertices must be selected by:

c(t(ii.e[which(ii.e[,1]==i),]))

clpippel avatar Jul 30 '22 09:07 clpippel

BTW instead of gen.edgelist you should use sample_degseq with the simple method, which is equivalent. The simple.no.multiple.uniform method is the same but rejects non-simple graphs.

Yes, these are bad names which have already been changed in the C core. https://igraph.org/c/html/develop/igraph-Generators.html#igraph_degree_sequence_game

szhorvat avatar Jul 30 '22 09:07 szhorvat

The bad performance is not in the C core but in creating vertex sequences, which is slow. Setting igraph_options(return.vs.es=F) will make it fast.

Observe the following difference in performance:

# Berger/circle schedule for round robin tournaments
# add edges after each round
igraph_options(return.vs.es=F)
system.time({
  n <- 2000L; d <- n %/% 2L
  n <- max(n - !(n %%2L), 1L)
  r <- c(rbind((seq(from = 0, to = -(n+1)/2+1) %% n) , seq(from=0, to=(n-1)/2)))  # diagonal in Zn
  g <- make_empty_graph(n)
  
  for (i in seq_len(.Machine$integer.max)) {
    g <- add_edges(g, r+1L)
    r <- ((r+(n-1L) %/% 2L) %% n)
	if (i >= d) break
   }
})
ecount(g)
# ca 30 sec
# append rounds to vector and then create graph
system.time({  
  n <- 2000L; d <- n %/% 2L
  n <- max(n - !(n %%2L), 1L)
  r <- c(rbind((seq(from = 0, to = -(n+1)/2+1) %% n) , seq(from=0, to=(n-1)/2)))  # diagonal in Zn
  es <- c()                                                                       # edges: sequence of vertex ids
   
  for (i in seq_len(.Machine$integer.max)) {
     es <- append(es, r)
     r <- ((r+(n-1L) %/% 2L) %% n)
     if (i >= d) break
   }
  g <- graph(es+1L)
})
ecount(g)
[1] 1e+06
ca 3 seconds

clpippel avatar Jul 30 '22 14:07 clpippel

I repeated the microbenchmark proposed by @jessexknight with and without igraph_options(return.vs.es=F) to get a baseline comparison; here are the results:

> microbenchmark(times=100,
+   {k1 = unlist(adjacent_vertices(G,s))},
+   {k2 = adjacent.i(ii.e,s)})
Unit: microseconds
                                         expr      min        lq      mean   median       uq      max neval
 {     k1 = unlist(adjacent_vertices(G, s)) } 4202.787 4260.9660 4903.0174 4280.748 5175.922 7606.361   100
             {     k2 = adjacent.i(ii.e, s) }  164.861  168.6945  178.3557  172.856  180.072  277.693   100
> igraph_options(return.vs.es=F)
> microbenchmark(times=100,
+   {k1 = unlist(adjacent_vertices(G,s))},
+   {k2 = adjacent.i(ii.e,s)})
Unit: microseconds
                                         expr     min      lq     mean   median       uq     max neval
 {     k1 = unlist(adjacent_vertices(G, s)) } 165.189 181.261 192.8747 197.0665 199.3625 329.312   100
             {     k2 = adjacent.i(ii.e, s) } 184.664 203.565 215.4288 219.2680 225.0285 284.089   100

So indeed this is a manifestation of #194, although I am yet to understand why return.vs.es=T means such a significant overhead.

ntamas avatar Aug 03 '22 21:08 ntamas

I've spent a bit more time debugging the vs/es creation process and commenting bits and pieces in and out to see how they would affect performance. It turns out that a disproportionate amount of time is spent in the vs/es creation process when we set up a weak reference that points back to the graph that the vs/es was created from (to prevent the user from using a vs/es object with another graph). For instance, simply removing the part of the code that adds the weak reference in unsafe_create_vs() brings the median execution time of unlist(adjacent_vertices(G, s)) down from 4280 us to 844 us (but then two of the tests fail and I need to check why). Even removing the condition is_igraph(graph) && !warn_version(graph) (which is used for backward compatibility checks) from the internal get_vs_ref function trims the median execution time from 4280 us to 2798 us.

So, all in all, creating vs and es objects is expensive due to the additional bookkeeping we need to do in order to associate a vs/es object to the graph it was created from, and it is unclear to me yet whether we could do away with some of these to gain some performance here.

Replacing "igraph" %in% class(graph) with inherits(graph, "igraph") in the is_igraph() function also yields some minor speedup (from 4280 usec to 3800 usec).

ntamas avatar Aug 03 '22 23:08 ntamas

Thanks for following-up to diagnose @ntamas -- To be honest I've all-but abandoned R-igraph for the time being, with the exception of needing layout_with_fr, as most of the interesting stuff I'm doing is building up the graph, for which I need a bunch of custom functions anyways. If helpful for any future readers, you can see the tiny module here.

EDIT (2022-12-13): for those interested, I ended up implementing layout_with_fr in base R here with an option to gganimate the evolving layout as shown here.

jessexknight avatar Aug 22 '22 12:08 jessexknight

Scheduling this for a future hackathon as the underlying issue with weakref management needs some additional R expertise beyond my skills.

ntamas avatar Nov 10 '22 22:11 ntamas

Self-contained reprex:

library(igraph)

set.seed(20221213)

gen.edgelist <- function(k.i) { # random graph from degree distribution
  i.e <- sample(rep(seq(length(k.i)), times = k.i)) # random vertices, rep by degree
  ii.e <- do.call(cbind, split(i.e, 1:2)) # egde list (random pairs)
}
adjacent.i <- function(ii.e, i) { # get adjacent vertices from edgelist
  i.adj <- c(ii.e[ii.e[, 1] %in% i, 2], ii.e[ii.e[, 2] %in% i, 1])
}
# graph generation
N <- 999 # number of vertices
k.i <- rgamma(N, 1, 0.1) # degree distribution
ii.e <- gen.edgelist(k.i) # edge list
#> Warning in split.default(i.e, 1:2): data length is not a multiple of split
#> variable
#> Warning in (function (..., deparse.level = 1) : number of rows of result is not
#> a multiple of vector length (arg 2)
G <- graph_from_edgelist(ii.e, directed = FALSE)
s <- sample(seq(N), round(0.5 * N)) # random sample


bench::mark(
  iterations = 100,
  check = FALSE,
  {
    k1 <- unlist(adjacent_vertices(G, s))
  },
  {
    k2 <- adjacent.i(ii.e, s)
  }
)
#> # A tibble: 2 × 6
#>   expression                                     min   median `itr/sec` mem_al…¹
#>   <bch:expr>                                <bch:tm> <bch:tm>     <dbl> <bch:by>
#> 1 { k1 <- unlist(adjacent_vertices(G, s)) }   4.61ms    5.3ms      177.    358KB
#> 2 { k2 <- adjacent.i(ii.e, s) }              164.7µs  176.9µs     4721.    296KB
#> # … with 1 more variable: `gc/sec` <dbl>, and abbreviated variable name
#> #   ¹​mem_alloc
igraph_options(return.vs.es = F)
bench::mark(
  iterations = 100,
  check = FALSE,
  {
    k1 <- unlist(adjacent_vertices(G, s))
  },
  {
    k2 <- adjacent.i(ii.e, s)
  }
)
#> # A tibble: 2 × 6
#>   expression                                  min median itr/s…¹ mem_a…² gc/se…³
#>   <bch:expr>                                <bch> <bch:>   <dbl> <bch:b>   <dbl>
#> 1 { k1 <- unlist(adjacent_vertices(G, s)) } 151µs  168µs   5429.  37.2KB       0
#> 2 { k2 <- adjacent.i(ii.e, s) }             147µs  166µs   5566. 296.5KB       0
#> # … with abbreviated variable names ¹​`itr/sec`, ²​mem_alloc, ³​`gc/sec`

Created on 2022-12-13 with reprex v2.0.2

krlmlr avatar Dec 13 '22 09:12 krlmlr