dgl-operator
dgl-operator copied to clipboard
The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network training on Kubernetes
DGL Operator
The DGL Operator makes it easy to run Deep Graph Library (DGL) graph neural network distributed or non-distributed training on Kubernetes. Please check out here for an introduction to DGL and dgl distributed training philosophy.
🛠Prerequisites
- Kubernetes >= 1.16
🚀Installation
You can deploy the operator with default settings by running the following commands:
git clone https://github.com/Qihoo360/dgl-operator
cd dgl-operator
kubectl create -f deploy/v1alpha1/dgl-operator.yaml
You can check whether the DGL Job custom resource is installed via:
kubectl get crd
The output should include dgljobs.qihoo.net
like the following:
NAME AGE
...
dgljobs.qihoo.net 1m
...
🔬Creating a DGL Job
You can create a DGL job by defining an DGLJob config file. See GraphSAGE.yaml or GraphSAGE_dist.yaml example config file for launching a single-node or multi-node GraphSAGE training job. You may change the config file based on your requirements.
# standalone GraphSAGE
cat examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
cat examples/v1alpha1/GraphSAGE_dist.yaml
Deploy the DGLJob resource to start training:
# standalone GraphSAGE
kubectl create -f examples/v1alpha1/GraphSAGE.yaml
# or a distributed version
kubectl create -f examples/v1alpha1/GraphSAGE_dist.yaml
💭 Reference
Please check out these previous works that helped inspire the creation of DGL Operator
-
PaddleFlow/paddle-operator - Elastic Deep Learning Training based on Kubernetes by Leveraging EDL and Volcano.
-
kubeflow/mpi-operator - Kubernetes Operator for Allreduce-style Distributed Training.