rigraph icon indicating copy to clipboard operation
rigraph copied to clipboard

write_graph() and read_graph should detect format from file name

Open szhorvat opened this issue 4 years ago • 21 comments

What is the feature or improvement you would like to see?

write_graph() and read_graph() use the default format, edgelist, when a format is not specified. This means that write_graph(g, "foo.graphml") will write the graph in edgelist format, not in GraphML.

This is a request to try to detect the format from the file name automatically whenever the format is not explicitly give.

Use cases for the feature

It would make using these functions easier and more intuitive. It is reasonable to expect write_graph(g, "foo.graphml") to write GraphML.

szhorvat avatar Feb 10 '22 08:02 szhorvat

@ntamas Not to annoy too much, but this would be a good candidate for 1.3.0 as well (since it's a change of behaviour that shouldn't go into 1.3.1), and low-hanging fruit.

szhorvat avatar Mar 28 '22 15:03 szhorvat

Sorry, it's probably not going to happen for 1.3.0; I am still wrestling with compilation issues in certain R-hub configurations and I also need to start re-running revdepchecks soon; plus we need to get this version out of the door by March 30 so I'd rather not spend any more time on issues that are non-essential for 1.3.0.

ntamas avatar Mar 28 '22 17:03 ntamas

We have the following formats:

  • pajek
  • edgelist
  • ncol
  • lgl
  • graphml
  • dimacs
  • gml
  • dot
  • leda

What are typical file extensions for those formats? I see gml.gz and .graphml.gz for GML and GraphML, respectively, and perhals .dot is used for GraphViz. Is there more?

krlmlr avatar Dec 14 '22 15:12 krlmlr

Pajek is typically .net. Edgelist files can be pretty arbitrary; it is to be debated whether we should automatically attempt to parse .txt files as edgelist or not. NCOL and LGL are .ncol and .lgl as far as I know. For GraphML, .graphml is the most common, although these are often distributed as gzipped files due to their size, so if we can handle .graphml.gz and decompress on-the-fly, that would be great. Let's ignore DIMACS (unless @szhorvat has a suggestion). You can use .gml for GML (the gzip comment also applies here) and .dot for GraphViz. LEDA uses .gw or .lgr.

ntamas avatar Dec 14 '22 15:12 ntamas

Nice! Do you have example files that we could add here or to igraphdata?

krlmlr avatar Dec 14 '22 15:12 krlmlr

tl;dr I recommend skipping everything except .net (Pajek), .gml and .graphml.

We may discuss if we want to have a default interpretation when the format couldn't be determined, and if NCOL (effectively a named edgelist) is a good choice for that.


Some more notes in addition what Tamás said:

Pajek has many subformats, but only .net is supported. Do not recognize the other extensions.

I suggest skipping the edgelist format. This format is dangerous in that if a single large number occurs, it will cause a huge memory allocation.

It might be good to also skip LGL and NCOL as these are not a common exchange format. They're special formats for the now defunct LGL software. NCOL happens to be useful to read named edge lists, but you won't find these with an .ncol extension around the net. That said, .ncol and .lgl are sufficiently uncommon as extensions that it's not harmful to include them.

There are many variations on the DIMACS format and igraph only supports one. It's not standardized. The extension is even less standardized. The format used for graph colouring often has a .col extension and Mathematica actually regonizes these as DIMACS. But igraph does not yet support the colouring format. It implements the max flow format. See https://github.com/igraph/igraph/issues/1924. I agree to skip.

igraph does not read GraphViz DOT files and LEDA files, so they're not relevant here.

UCINET DL was not mentioned. The extension I've seen was .dat, which is maybe too generic. I suggest skipping.

szhorvat avatar Dec 14 '22 15:12 szhorvat

For reference, this is what other software do:

  • Mathematica: https://reference.wolfram.com/language/guide/MathematicalDataFormats.html Mathematica will auto-gunzip anything with .gz so a trailing .gz is always allowed.
  • Maple: https://de.maplesoft.com/support/help/Maple/view.aspx?path=Formats&cid=562

szhorvat avatar Dec 14 '22 15:12 szhorvat

Nice! Do you have example files that we could add here or to igraphdata?

GML: http://www-personal.umich.edu/~mejn/netdata/ GraphML: Export some with igraph. Pajek: http://vlado.fmf.uni-lj.si/pub/networks/data/ Also has some UCINET DL in there.

szhorvat avatar Dec 14 '22 15:12 szhorvat

The site linked above for GML has a login prompt; I guess it's not available any more. So what's left for us is to take one of the "famous" graphs that are already bundled in igraph via the igraph_famous() function, dump that to GML and GraphML, and add it to igraphdata so we can easily access it from unit tests. (@krlmlr alternatively, can these files be bundled with igraph itself?).

ntamas avatar Dec 15 '22 12:12 ntamas

@ntamas You probably have some extension that rewrites http:// to https:// Make sure you use http://, exactly as I wrote it, and there's no login prompt.

szhorvat avatar Dec 15 '22 15:12 szhorvat

Also, we have extensive tests for format readers in the C core, primarily through OSS-fuzz. You can check what the fuzzer build script copies into the fuzzer corpus and pick something from there if you like:

https://github.com/igraph/igraph/blob/master/fuzzing/build.sh

Note that some of these test files are corrupted (on purpose).

szhorvat avatar Dec 15 '22 15:12 szhorvat

celegansneural.zip

This is the C.elegans neural network from Mark Newman's page, converted into GML, GraphML and Pajek. This should be suitable for testing purposes.

ntamas avatar Dec 15 '22 17:12 ntamas

Perhaps the Les Miserables network is a better example. Just like the C. elegans neural network, it has both vertex attributes (names) and edge attributes (weights). However, the vertex names don't look like numbers, so it's less confusing. People regularly confuse vertex names with vertex IDs.

szhorvat avatar Dec 15 '22 19:12 szhorvat

lesmis.zip

This is the Les Miserables network in GML, GraphML and Pajek.

ntamas avatar Dec 17 '22 10:12 ntamas

The Pajek file does not contain the correct names.

szhorvat avatar Dec 20 '22 08:12 szhorvat

igraph's Pajek writer exports only the value of the id attribute (not the name attribute) as the vertex name. This despite the reader creating both a name and id attribute. This should be corrected in the C core (can someone open an issue?)

Attached is a corrected file.

lesmis.net.gz

Furthermore, the creator field in the GML file does not look right.

szhorvat avatar Dec 20 '22 08:12 szhorvat

Thanks for the fix; indeed this should be fixed in the C core then, I was simply saving the Pajek file from igraph itself but didn't check the output.

As for the creator field, let me check that; it might be a consequence of me trying to mess with the quoting of PACKAGE_VERSION on the command line when compiling the R interface.

ntamas avatar Dec 20 '22 08:12 ntamas

Just remove the creator line altogether with a text editor. It's not necessary.

szhorvat avatar Dec 20 '22 08:12 szhorvat