notes
notes copied to clipboard
Visualizing Wikipedia
https://wikiscape.org (currently offline, big picture here: https://ibb.co/mJN3DM5) https://github.com/void4/wikiscape
I really like https://paperscape.org
So far, I've tried 3 times to use their code to layout and graph the entire Wikipedia.
There are several issues:
The paperscape guys have an incrementally updateable database. With Wikipedia, I only have a large xml.bz2 dump. I wrote scripts to:
- decompress the 18GB .bz2, this in chunks with 20000 pages each (999 chunks, total ~76GB)
- extract the links and write them into a .csv adjacency lists (multiprocessing, this takes about 2 hours)
- combine the .csv's into one file (~9GB)
- create a new file from that one with short ids, create a shortid:full name map and store it in a file
- process the short id file (say, remove pages with an indegree less than x), then translate the short ids back to the full names and clean them (remove category links, image links, characters that the graph programs or libraries can't process)
While the paperscape nbody layout algorithm seems to be well suited for layouting highly-connected graphs, the C/Go layouting, tiling and webserver software mainly revolves around a MySQL database with a schema that is specific to Arxiv.org papers. The tools can use .json files as inputs and outputs but these are pretty verbose.
Unlike the paperscape dataset, information contained in Wikipedia articles often has geographical relevance, so the "abstract" topic (say, football) and the geographic information (e.g. United States) are linked very closely together. Scientific papers are much more international.
When trying to layout with https://gephi.org/ instead, it is necessary to reduce the number of edges to make the layouting computationally tractable. Which edges should be used? Those with the least number of total references (above some minimum)? This would ensure some degree of "locality".
Beside paperscape-nbody and gephi I tried node2vec, though with unsatisfying results.
Next up: https://github.com/lferry007/LargeVis
Promising results!
When limiting the enwiki-20200201-pages-articles-multistream.xml.bz2
dump to those pages which are referred to by at least 5 other pages, we get a graph with ~6.6M vertices and 144M edges.
For future reference, it might have been faster to use https://www.mediawiki.org/wiki/Manual:Pagelinks_table
A picture from my second time trying paperscape-nbody:
For the wikiscape mouse click lookups I use a quadtree to make the nearest point search more efficient: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html
Tiles up to a certain depth (currently 12) are generated and served as static images, any tiles closer than that are generated dynamically from the quadtree.