python-igraph
python-igraph copied to clipboard
PageRank Error: Fatal error at src/core/vector.c:126 : Assertion failed: size >= 0
Describe the bug Got an error when running PageRank algorithm on a scale graph. The number of edge is 2658026604, the size of edge file is 51G, and the edge file format as follow
woker@robot:benchmark_datasets$ head datagen-scale_1X.txt
102309412 0
13241850 1
39897563 2
60103142 3
68395754 3
56599636 5
10025524 5
13963849 5
10428790 5
93483924 5
The machine configuration is that cpu cores is 32 and memory is 320G.
Here is the error message.
Fatal error at src/core/vector.c:126 : Assertion failed: size >= 0
The test code is
import sys
import os
import time
import igraph
from igraph import *
if __name__ == '__main__':
data_file = "datagen-scale_1X.txt"
print("begin load graph...")
s_time = time.time()
g = igraph.read(data_file, format="edge", directed=True)
print(f"vertex count {g.vcount()}, edge count {g.ecount()}")
print(f"finished load graph, cost: {time.time() - s_time}")
print("====== start pagerank =======")
s_time = time.time()
res = g.pagerank(niter=10, eps=1e-6)
print(f"res count: {len(res)}")
print(res[0:10])
print(f"algorithm cost: {time.time() - s_time}")
Version information python 3.7.12 igraph 0.9.10
Load graph from file is ok, the error arise out of PageRank algorithm.
Version 0.9 does not support such huge graphs. 0.10 will. Please try with the development version and let us know if it works: https://github.com/igraph/python-igraph/#compiling-the-development-version
Also, do the expected vertex and edge counts match? I'm asking because it looks like your input file might have gaps in the IDs. igraph requires the numeric vertex IDs to be consecutive integers, so if you have an edge list that contains a single edge like 1 1000000, you'll get 1000001 vertices, out of which 999999 will be isolates. If you have gaps in the IDs, load the file as "ncol", not edgelist, as "ncol" will renumber the vertices and assign the original IDs as a vertex attribute.
2658026604 is just over the 31-bit limit:
In[44]:= Log2[2658026604] // N
Out[44]= 31.3077
igraph_pagerank() can currently use one of two methods: PRPACK and ARPACK.
PRPACK requires an edge count less than 2^31. The expected behaviour in 0.10 is to throw an error. We might be able to patch PRPACK to work with larger graphs, but this will be quite some work and it probably won't happen in the short term.
ARPACK requires a vertex count less than 2^31. It is unclear if it also requires an edge count smaller than this—I expect not. It might work. Be sure to use the NCOL reader as Tamás said to ensure that you do not get unnecessary vertices. The tracking issue for supporting large arrays with BLAS/LAPACK/ARPACK is here and is again not a short-term project.
If you could test both methods in 0.10 (i.e. develop branch) and report back that would be much appreciated.
2658026604is just over the 31-bit limit:In[44]:= Log2[2658026604] // N Out[44]= 31.3077
igraph_pagerank()can currently use one of two methods: PRPACK and ARPACK.PRPACK requires an edge count less than 2^31. The expected behaviour in 0.10 is to throw an error. We might be able to patch PRPACK to work with larger graphs, but this will be quite some work and it probably won't happen in the short term.
ARPACK requires a vertex count less than 2^31. It is unclear if it also requires an edge count smaller than this—I expect not. It might work. Be sure to use the NCOL reader as Tamás said to ensure that you do not get unnecessary vertices. The tracking issue for supporting large arrays with BLAS/LAPACK/ARPACK is here and is again not a short-term project.
If you could test both methods in 0.10 (i.e.
developbranch) and report back that would be much appreciated.
@szhorvat THX, I have some confusion that why the edge count less than 2^31. If it's using 32 bits unsigned integer as the index of vertor in storage, the max edge count is 2^32 (about 41 billion), but why limit 2^31 (about 21 billion). Is the vector index used 32 bits signed integer?
Also, do the expected vertex and edge counts match? I'm asking because it looks like your input file might have gaps in the IDs. igraph requires the numeric vertex IDs to be consecutive integers, so if you have an edge list that contains a single edge like
1 1000000, you'll get 1000001 vertices, out of which 999999 will be isolates. If you have gaps in the IDs, load the file as "ncol", not edgelist, as "ncol" will renumber the vertices and assign the original IDs as a vertex attribute.
@ntamas Thx, I got it, The max vertex id is 1682052736(less than 2^31) in my graph, But I'm not sure there are any isolated nodes. I try to use load the file as "ncol" and run agian.
Yes, it uses signed integers. Were you able to test with the development version and see if the ARPACK method works?
It's a pity, when I tried to rebuild the develop version and run pagerank algorithm in these days, which get a "Too many edges for PRPACK" error.
The develop version use the PRPACK, but I don't konwn how to use ARPACK method.
The detail error message
Traceback (most recent call last):
File "igraph/page_rank.py", line 29, in <module>
res = g.pagerank(niter=10, eps=1e-6)
File "/opt/conda/lib/python3.7/site-packages/igraph-0.9.11-py3.7-linux-x86_64.egg/igraph/structural.py", line 94, in _pagerank
eps,
igraph._igraph.InternalError: Error at src/centrality/prpack.cpp:46: Too many edges for PRPACK. -- Failed
Use implementation="arpack"
https://igraph.org/python/api/latest/igraph.Graph.html#pagerank
Using implementation="arpack" raise an "Segmentation fault" exception and there is no stack information.
Can you please double-check that your graph has strictly fewer vertices than $2^{31}$? Can you please report the vertex and edge counts here? Is the graph weighted?
I looked at the code and I do not see any reason why this should fail, but I do not currently have access to a machine with enough memory to test it. I added a safety check for the vertex count. If you re-install the development version now, it should include that check. However, if your graph has fewer than $2^{31}$ vertices, it should make no difference.
Any tips here @ntamas ? I can't see anything wrong in the Python/C glue code either.
Ok, I will check it again and give you feedback later.
The maximum vertex id is 434943375,which far less than $2^{31}$. The graph is unweighted. Read data from the file is ok. The graph object created successfully.
vertex count 434943376, edge count 2658026604
I rebuild the python-igraph development version using igraph-4d2a03c30d303f6e6b9a64c7bf0c334b856c3c8c. This igraph version is work?
The test code
import sys
import os
import time
sys.path.append(os.path.abspath(os.path.dirname(__file__)+'/'+'..'))
import igraph
from igraph import *
from common.time import Timer
from common.logger import logger
if __name__ == '__main__':
timer = Timer()
data_file = sys.argv[1]
logger.info("begin load graph...")
g = igraph.read(data_file, format="edge", directed=True)
logger.info(f"vertex count {g.vcount()}, edge count {g.ecount()}")
logger.info(f"finied load graph, cost: {timer.cost()}")
logger.info(f"build graph cost: {timer.cost()}")
logger.info("====== start pagerank =======")
alg_timer = Timer()
res = g.pagerank(implementation="arpack", niter=10, eps=1e-6)
logger.info(f"res count: {len(res)}")
logger.info(res[0:10])
logger.info(f"algorithm cost: {alg_timer.cost()}")
The stdout log
2022-08-11 04:50:44,272 [INFO] page_rank.py:18 - begin load graph...
2022-08-11 05:43:43,548 [INFO] page_rank.py:22 - vertex count 434943376, edge count 2658026604
2022-08-11 05:43:43,548 [INFO] page_rank.py:23 - finied load graph, cost: 3179.28
2022-08-11 05:43:43,548 [INFO] page_rank.py:24 - build graph cost: 3179.28
2022-08-11 05:43:43,549 [INFO] page_rank.py:26 - ====== start pagerank =======
Segmentation fault
Switching to an another dataset(twitter-2010) was successful.
The stdout log
2022-08-11 07:30:27,843 [INFO] topoanalysis:igraph/page_rank.py:18 - begin load graph...
2022-08-11 07:55:54,097 [INFO] topoanalysis:igraph/page_rank.py:22 - vertex count 41652230, edge count 1468365182
2022-08-11 07:55:54,098 [INFO] topoanalysis:igraph/page_rank.py:23 - finied load graph, cost: 1526.26
2022-08-11 07:55:54,098 [INFO] topoanalysis:igraph/page_rank.py:24 - build graph cost: 1526.26
2022-08-11 07:55:54,098 [INFO] topoanalysis:igraph/page_rank.py:26 - ====== start pagerank =======
2022-08-11 08:40:09,094 [INFO] topoanalysis:igraph/page_rank.py:31 - res count: 41652230
2022-08-11 08:40:09,094 [INFO] topoanalysis:igraph/page_rank.py:33 - [7.475319044789155e-09, 4.237505003125664e-09, 7.1960692862092996e-09, 7.1960692862092996e-09, 7.196069286209607e-09, 7.196069286209607e-09, 3.828410197182684e-08, 2.4621186119459063e-07, 1.0186054720454494e-07, 1.0186054720454402e-07]
2022-08-11 08:40:09,094 [INFO] topoanalysis:igraph/page_rank.py:35 - algorithm cost: 2655.0
The vertex count and edge count of twitter-2010 are fewer than 2^{31}
I rebuild the python-igraph development version using igraph-4d2a03c30d303f6e6b9a64c7bf0c334b856c3c8c. This igraph version is work?
Yes, that should work.
I'm thinking about two possibilities but I don't know yet whether any of these is relevant or not:
- ARPACK may or may not support matrices with that many edges; I find the vendored ARPACK in igraph suspicious because it uses
integeras the type all along thef2c-translated code, andintegeris typedef'd to anint. So, if ARPACK uses anintegeranywhere to represent edge indices (although I'm not sure whether it really does - it is not supposed to work with the edges directly), that might be problematic. - Python might have problems allocating a list of length equal to the number of vertices, so the error might be not in igraph's calculation but at the point where we translate igraph's result back into a Python list. Since we are not using NumPy matrices, the translation is probably hard enough for Python because it needs to allocate one huge list and one additional
floatobject for each list element (instead of just returning a NumPy array, which does not allocate float objects until they are actually retrieved from the list).
I find 1 more likely than 2, but I need to investigate further.
@jmzhoulab Any chance to compile python-igraph in debug mode and then run the PageRank calculation through gdb or lldb to get a stack trace of the crash? Simply replace the two occurrences of Release (note the capital R) with Debug in setup.py, remove the build and dist folders, and rebuild with python setup.py install. Then attach a debugger to the Python process before it starts the PageRank calculation and get a stack trace of the crash.
I don't think (1) is possible because ARPACK never becomes aware of edges. We simply pass ARPACK a function that takes a vector and computes the action of the PageRank matrix on that vector. ARPACK does not need to represent that matrix—it just needs a function to compute its action.
Am I missing something @ntamas ?
That's true, but I cannot think of any other reason; that's why it would be good to see a stack trace.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
I have not been able to test this as I simply don't have access to a machine that has enough memory.
If you can build and test the latest version with AddressSanitizer that would be very immensely useful @jmzhoulab. Instructions for building and running python-igraph with AddressSanitizer are here: https://github.com/igraph/igraph/wiki/Using-sanitizers-to-find-bugs#python-igraph
Alternatively, if you're willing to test a piece of C code instead, that would be useful as well. Let me know and I can help port your Python test program to C and provide instructions for compiling/testing it. This might actually be easier than dealing with Python.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.