FeatGraph
FeatGraph copied to clipboard
use end-to-end DGL scripts run featGraph
Hi, I want to run the featGraph end-to-end. I have already built the DGL (with featGraph) and run the test.py file successfully using the instructions posted in https://github.com/dmlc/dgl/tree/master/featgraph.
- If I want to run an end-to-end GCN training on Pubmed or Reddit dataset, can I just use the DGL GCN benchmark script I have before without changing any kernel names? In other words, which parts of the code of DGL python script do I need to change so that I can run the featGraph(not DGL) end-to-end? Thank you.
You might checkout this branch of DGL:
https://github.com/kira-lin/dgl/tree/tvm_integration
Thanks for your reply. I just clarified my question by re-editing the post above. Can you respond again? Thank you.
I used the DGL test scripts to run the GCN on PubMed and Cora dataset with extra one line of code: dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so")
The python script works fine without any error. But the training time of featGraph is the same as DGL. It seems like featGraph does not improve any training time efficiency.
I don't think Featgraph has better performance against cusparse for GCN on GPU, see table IV in the paper, since DGL uses cusparse, it's normal that you don't observe any acceleration here.
Thank you very much for your response. I am closing this issue.
Sorry I just noticed that you were using dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so")
to use featgraph
as backend, actually the integration was abandoned because TVM do not have native sparse support and we might encounter several issues when used in production, so you will still be using DGL's native backend in most cases even if load the module.
Only the branch I mentioned (https://github.com/kira-lin/dgl/tree/tvm_integration) contains the complete code that uses featgraph backend. Regarding the question in #14 , yes GAT is also supported (it was mentioned in the paper), and we can use it by compiling the tvm_integration
branch.
If you are interested in native sparse support of TVM, our work is coming soon, please stay tuned.
Hi, thank you for the kind response. For the branch https://github.com/kira-lin/dgl/tree/tvm_integration, If I want to use the featGraph backend, what is the specific python code I needed to write? For example, If I only write dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so")
, will the featGraph backend be used automatically? If not, which python code do I need to use so that the I can use the featGraph GCN and GAT backend ?
The ReadMe file in https://github.com/kira-lin/dgl/tree/tvm_integration/featgraph only shows to run test.py to verify the. correctness. However, the test.py only contains a test case kernel: dgl.sparse._CAPI_FG_SDDMMTreeReduction(gidx, u, v, e)
for sddmm kernels. It is a little bit hard for me to know how to run other featGraph kernel backends. Could you provide more detailed instructions about which python code I need to write so that I can use the featGraph GCN and GAT backend kernels? Thank you.
This is the step we followed:
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ git branch
master
* tvm_integration
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ pwd
/home/ygong07/dgl_src/dgl_tvm/dgl/build
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ cmake -DUSE_CUDA=ON -DUSE_TVM=ON ..
-- Start configuring project dgl
-- Build with CUDA support
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- Found OpenMP_C: -fopenmp
-- Found OpenMP_CXX: -fopenmp
-- -fopenmp -O2 -Wall -fPIC -std=c++11 -DUSE_AVX -DIDXTYPEWIDTH=64 -DREALTYPEWIDTH=32
-- Running GPU architecture autodetection
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
-- Found CUDA arch 8.0
-- CUDA flags: -Xcompiler ,-fopenmp,-O2,-Wall,-fPIC,,,-DUSE_AVX,-DIDXTYPEWIDTH=64,-DREALTYPEWIDTH=32;-gencode;arch=compute_80,code=sm_80;--expt-extended-lambda;-Wno-deprecated-declarations;-std=c++14
-- Found OpenMP_C: -fopenmp
-- Found OpenMP_CXX: -fopenmp
-- /home/ygong07/dgl_src/dgl_tvm/dgl/third_party/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Start configuring project featgraph
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- /usr/local/cuda-11.2/include
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ygong07/dgl_src/dgl_tvm/dgl/build
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ make -j4
[ 1%] Creating featgraph kernels...
[ 6%] Built target dmlc
[ 34%] Built target metis
/home/ygong07/tvm/python/tvm/driver/build_module.py:242: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
warnings.warn(
[ 34%] Built target featgraph_kernel
[ 35%] Built target featgraph_runtime
[ 35%] Linking CXX shared library libdgl.so
[100%] Built target dgl
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ python3 test.py
Using backend: pytorch
tensor([[[1.5832],
[1.8842]],
[[1.1876],
[2.5858]],
[[1.5149],
[0.9924]],
...
[[2.2963],
[1.3279]],
[[1.7643],
[1.2339]],
[[2.3274],
[1.7878]]], device='cuda:0')
[[[1.5831739]
[1.8842214]]
[[1.1875974]
[2.5857563]]
[[1.5148897]
[0.9924001]]
....
[[2.2962904]
[1.3278971]]
[[1.7643319]
[1.233911 ]]
[[2.3274217]
[1.7877729]]]
- We run GCN and GAT scripts using
dgl.sparse._CAPI_FG_LoadModule("/home/ygong07/dgl_src/dgl_tvm/dgl/build/featgraph/libfeatgraph_kernels.so")
- The training time are same as DGL training time
- Please let us know if you see any issues as these numbers will be reported in a research paper.
Thank you very much for your help.
Oh sorry, what I mean is the tvm-kernel branch.
Hi, the tvm-kernel branch you mentioned does not include the 'featGraph' folder. Therefore, I am not sure how to compile it specifically for featgraph and how to verify whether the featgraph is installed correctly or not. Could you provide me with more instructions? Thank you.
The tvm-kernel branch is fully Python based, and featgraph kernels would be triggered when you set the environment variable DGLENGINE
to true.
See https://github.com/kira-lin/dgl/blob/tvm-kernel/python/dgl/sparse.py#L13-L16
Btw I do think you are not expected to see speedup using featgraph against DGL 0.8 because most of the optimized kernels have already been merged into DGL.
13 use_tvm = True if 'DGLENGINE' in os.environ and os.getenv('DGLENGINE') == 'tvm' else False
14 if use_tvm:
15 import tvm
16 from .tvm import gsddmm, gspmm
based on line 13, we make sure use_tvm is True, unfortunately, it crashes. When use_tvm is False, it does run, but I suspect it is calling DGL kernels.
We are still interested in running FeatGraph end-to-end. Do let us know if there are any other instructions.
Would you mind elaborating the error message so that we can debug why crashes?
Here is what the error I got:
(base) ygong07@mira0:~/compare_graphPy/GraphPy_GPU/build$ python3 GCN_pubmed_dgl.py
Using backend: pytorch
use_tvm True
Output of Read function is
/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/base.py:45: DGLWarning: Recommend creating graphs by `dgl.graph(data)` instead of `dgl.DGLGraph(data)`.
return warnings.warn(message, category=category, stacklevel=1)
graph creation time is: 0:00:00.029156
Traceback (most recent call last):
File "GCN_pubmed_dgl.py", line 244, in <module>
logits = net(graph, feature)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "GCN_pubmed_dgl.py", line 193, in forward
h = self.conv1(g, inputs)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/nn/pytorch/conv/graphconv.py", line 269, in forward
graph.update_all(fn.copy_src(src='h', out='m'),
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/heterograph.py", line 4499, in update_all
ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
ndata = invoke_gspmm(g, mfunc, rfunc)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
z = op(graph, x)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 171, in func
return gspmm(g, 'copy_lhs', reduce_op, x, None)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 62, in gspmm
ret = gspmm_internal(g._graph, op,
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 87, in _gspmm
return _gspmm_tvm(gidx, op, reduce_op, u, e) if use_tvm \
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 373, in _gspmm_tvm
mod = gspmm.spmm(
File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/tvm/gspmm.py", line 301, in spmm
if topi.util.get_const_int(topi.util.prod(out.shape[1:])) < 16:
AttributeError: module 'tvm.topi' has no attribute 'util'
This is due to the TVM version, you should use TVM 0.7.