stellar-random-walk
                                
                                 stellar-random-walk copied to clipboard
                                
                                    stellar-random-walk copied to clipboard
                            
                            
                            
                        node2vec
Spark+scala implementation of neighborhood sampling and feature learning for large graphs:
This repository includes the implementation of node2vec.
Features
- Scalable node2vec
- Second-order random walk
- Memory-efficient graph data structure introduced @ Spark Summit 2017 "Random Walks on Large Scale Graphs with Apache Spark" presented by Min Shen (LinkedIn)
- Leverages graph partition information in order to optimize the communication and to speed up the random walk computation
- Compatible with Spark-JobServer
Requirements
- Scala 2.11 or later.
- Maven 3+
- Java 8+
- Apache Spark 2.2.0 or later.
- (Optional): Spark-JobServer 0.8.0 or later.
Quick Setup
- 
Get the random walk application source code: - using git: ' git clone [email protected]:data61/stellar-random-walk.git '
- using http: download the zip file and unzip it.
 
- 
Go to the source code directory. A pre-built jar file, named randomwalk-0.0.1-SNAPSHOT.jar, is available at ./target. To run the application, you use this jar file. (If you want to build the jar file from the source code, you need to have Apache Maven installed and run: mvn clean package)
- 
Download Apache Spark 2.2.0 or later (e.g release 2.2.1, pre-built for apache hadoop 2.7) 
Run the Application Using Spark (local machine)
To run the application on your machine, you can use spark-submit script. Go to the Apache Spark directory. Run the application with the following command:
bin/spark-submit --class au.csiro.data61.randomwalk.Main [random walk dir]/target/randomwalk-0.0.1-SNAPSHOT.jar
Run the Application Using Spark Job-server
- 
make sure that the prerequisites are installed: - Apache spark (e.g release 2.2.1, pre-built for apache hadoop 2.7)
- Java Virtual Machine (e.g. 9.0.1)
- sbt
 
- 
git clone job-server 
- 
Create a .bashrcwith the paths to JVM, sbt and spark, e.g., for Mac OS users it will be the following:export SBT=/usr/local/Cellar/sbt/1.1.0 export SPARK_HOME=~/spark-2.2.1-bin-hadoop2.7 export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.1.jdk export PATH=$JAVA_HOME/bin:$SBT/bin:$PATH
- 
Run source .bashrc
- 
Go to spark folder and run start-all.sh:
cd spark-2.2.1-bin-hadoop2.7/
sbin/sbin/start-all.sh
- From spark Job-server run sbt shell:
cd spark-jobserver/job-server/
sbt
- 
Once inside sbt shell, run reStart.
- 
Go to http://localhost:8090/and make sure that Spark Job Server UI is working (Note: in Chrome binaries were not updated properly, while in Firefox it was ok)
- 
upload randomwalk jar to the server: curl --data-binary @randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar localhost:8090/jars/randomwalk
- 
submit a job: curl -d "rw.input = --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 --directed false --input [random walk dir]/src/test/resources/karate.txt --output [output dir] --partitioned false" 'localhost:8090/jobs?appName=randomwalk&classPath=au.csiro.data61.randomwalk.Main'
- 
Check the status in the Spark Job-server UI 
Application Options
The following options are available:
   --walkLength <value>     walkLength: 80
   --numWalks <value>       numWalks: 10
   --p <value>              return parameter p: 1.0
   --q <value>              in-out parameter q: 1.0
   --rddPartitions <value>  Number of RDD partitions in running Random Walk and Word2vec: 200
   --weighted <value>       weighted: true
   --directed <value>       directed: false
   --w2vPartitions <value>  Number of partitions in word2vec: 10
   --input <value>          Input edge-file/paths-file: empty
   --output <value>         Output path: empty
   --singleOutput <value>   Whether to write the output in a single file: true
   --cmd <value>            command: randomwalk/embedding/node2vec (to run randomwalk + embedding)
   --partitioned <value>    Whether the graph is partitioned: false
   --lr <value>             Learning rate in word2vec: 0.025
   --iter <value>           Number of iterations in word2vec: 10
   --dim <value>            Number of dimensions in word2vec: 128
   --window <value>         Window size in word2vec: 10
For example:
   bin/spark-submit --class au.csiro.data61.randomwalk.Main ./randomwalk/target/randomwalk-0.0.1-SNAPSHOT.jar \
   --cmd randomwalk --numWalks 1 --p 1 --q 1 --walkLength 10 --rddPartitions 10 \
   --input [input edge list] --output [output directory] --partitioned false
You can choose the algorithm to run by the --cmd option:
- randomwalk: to run just the second-order randomwalk.
- embedding: to run just word2vec given paths as input.
- node2vec: to run randomwalk + embedding
Graph Input File Format
The input graph must be an edge list with integer vertex IDs. For example:
src1-id dst1-id
src1-id dst2-id
... 
If the graph is weighted, it must include the weight in the last column for each edge. For example:
src1-id dst1-id 1.0
If the graph is partitioned, each edge should have a partition number, i.e., should be assigned to a partition. The partition number must be in the third column of the edge list. For example:
src1-id dst1-id 1 1.0
src1-id dst2-id 1 1.0
src3-id dst1-id 2 1.0
... 
The application itself will replicate (cut) those vertices that span among multiple partitions.
Output Format
The application writes output to the disk in the directory given by the --output parameter. In this directory, three folders may be created:
- /path: the result for randomwalk.
- /vec: vector representation of vertices.
- /bin: word2vec model's metadata.
The output of the randomwalk (in the /path directory) is in the format of tab-separated integers (vertex-ids), where each line represents random steps starting from a vertex. The number of lines are equal to the number of generated paths (--numWalks*|V|). The result is partitioned in --rddPartitions number of files as plain text unless you set --singleOutput to true.
The embeddings are also in the the format of tab-separated numbers per line, where the first number represents the vertex-id and the rest of the numbers in that line represent the vertex's vector representation. The result is partitioned in --rddPartition number of files and is written as plain text unless you set --singleOutput to true.
An example of randomwalk output:
4	2	4	2	4
32	26	32	33	16
20	2	31	2	31
...
An example of vecotor representations:
34	-0.055403516	-0.007816158	0.053844377	-0.038890388	0.13484207	-0.07744342
12	-0.025323598	0.054739047	-0.077121474	-0.016575111	-0.041371193	-0.046176624
8	-0.07687951	0.120588094	-0.11744399	-0.058044944	-0.061426725	-0.13111477
...
References
- (Grover, Aditya, and Jure Leskovec. "node2vec: Scalable feature learning for networks." Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2016.).
- Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
- Random Walks on Large Scale Graphs with Apache Spark