gapbs
gapbs copied to clipboard
[converter] allow '#' as comment character in edge list parsing
Many people download graphs from the Stanford SNAP database. These are edgelist files (.el) however they contain some comments as a header. Manually spicing out the first four lines of a 32GB file is painful, so here is a quick way to allow converter
to work on the file without modification.
Sample test and output:
$ cat > foo.el
# Undirected graph: ../../data/output/friendster.txt
# Friendster
# Nodes: 65608366 Edges: 1806067135
# FromNodeId ToNodeId
101 102
121 104
131 107
141 125
101 165
101 168
151 170
101 176
161 180
101 181
191 182
102 209
103 210
101 248
101 306
104 329
105 330
106 340
^D
$ ./converter -f foo.el -b foo.sg
# ignoring comment
# ignoring comment
# ignoring comment
# ignoring comment
Read Time: 0.00448
Build Time: 0.00343
Graph has 341 nodes and 18 directed edges for degree: 0
It's been a few months. Just checking, is this worthy of merging to mainline? Thanks!
Sorry for the delay!
Thank you for the PR!
Although SNAP is commonly used, this change adds complexity. Gapbs primitively uses file suffixes to identify file types, and we currently don't go near .txt
(commonly used on SNAP) since it could mean so many things.
For cases like this, I recommend filtering out those comment lines:
grep -v # WikiVote.txt > WikiVote.el
That is unfortunate, since the example grep command would take a long time for large graphs. Instead, with the patch above the user could just run...
mv WikiVoke.txt WikiVote.el
...and have it be ready to process.
With pipes, the grep command takes no extra time. The file is still read once, and the process is still IO-bound.