FeatGraph use end-to-end DGL scripts run featGraph

Hi, I want to run the featGraph end-to-end. I have already built the DGL (with featGraph) and run the test.py file successfully using the instructions posted in https://github.com/dmlc/dgl/tree/master/featgraph.

If I want to run an end-to-end GCN training on Pubmed or Reddit dataset, can I just use the DGL GCN benchmark script I have before without changing any kernel names? In other words, which parts of the code of DGL python script do I need to change so that I can run the featGraph(not DGL) end-to-end? Thank you.

May 23 '22 17:05 Ed-gong

You might checkout this branch of DGL:

https://github.com/kira-lin/dgl/tree/tvm_integration

May 23 '22 21:05 yzh119

Thanks for your reply. I just clarified my question by re-editing the post above. Can you respond again? Thank you.

May 24 '22 18:05 Ed-gong

I used the DGL test scripts to run the GCN on PubMed and Cora dataset with extra one line of code: dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so") The python script works fine without any error. But the training time of featGraph is the same as DGL. It seems like featGraph does not improve any training time efficiency.

Jun 01 '22 21:06 Ed-gong

I don't think Featgraph has better performance against cusparse for GCN on GPU, see table IV in the paper, since DGL uses cusparse, it's normal that you don't observe any acceleration here.

Jun 01 '22 22:06 yzh119

Thank you very much for your response. I am closing this issue.

Jun 02 '22 20:06 Ed-gong

Sorry I just noticed that you were using dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so") to use featgraph as backend, actually the integration was abandoned because TVM do not have native sparse support and we might encounter several issues when used in production, so you will still be using DGL's native backend in most cases even if load the module.

Only the branch I mentioned (https://github.com/kira-lin/dgl/tree/tvm_integration) contains the complete code that uses featgraph backend. Regarding the question in #14 , yes GAT is also supported (it was mentioned in the paper), and we can use it by compiling the tvm_integration branch.

Jun 07 '22 21:06 yzh119

If you are interested in native sparse support of TVM, our work is coming soon, please stay tuned.

Jun 07 '22 21:06 yzh119

Hi, thank you for the kind response. For the branch https://github.com/kira-lin/dgl/tree/tvm_integration, If I want to use the featGraph backend, what is the specific python code I needed to write? For example, If I only write dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so"), will the featGraph backend be used automatically? If not, which python code do I need to use so that the I can use the featGraph GCN and GAT backend ?

The ReadMe file in https://github.com/kira-lin/dgl/tree/tvm_integration/featgraph only shows to run test.py to verify the. correctness. However, the test.py only contains a test case kernel: dgl.sparse._CAPI_FG_SDDMMTreeReduction(gidx, u, v, e) for sddmm kernels. It is a little bit hard for me to know how to run other featGraph kernel backends. Could you provide more detailed instructions about which python code I need to write so that I can use the featGraph GCN and GAT backend kernels? Thank you.

Jun 10 '22 17:06 Ed-gong

This is the step we followed:

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ git branch
  master
* tvm_integration
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ pwd
/home/ygong07/dgl_src/dgl_tvm/dgl/build
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ cmake -DUSE_CUDA=ON -DUSE_TVM=ON ..
-- Start configuring project dgl
-- Build with CUDA support
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- -fopenmp -O2 -Wall -fPIC -std=c++11  -DUSE_AVX -DIDXTYPEWIDTH=64 -DREALTYPEWIDTH=32
-- Running GPU architecture autodetection
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
-- Found CUDA arch 8.0
-- CUDA flags: -Xcompiler ,-fopenmp,-O2,-Wall,-fPIC,,,-DUSE_AVX,-DIDXTYPEWIDTH=64,-DREALTYPEWIDTH=32;-gencode;arch=compute_80,code=sm_80;--expt-extended-lambda;-Wno-deprecated-declarations;-std=c++14
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- /home/ygong07/dgl_src/dgl_tvm/dgl/third_party/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Start configuring project featgraph
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- /usr/local/cuda-11.2/include
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ygong07/dgl_src/dgl_tvm/dgl/build

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ make -j4
[  1%] Creating featgraph kernels...
[  6%] Built target dmlc
[ 34%] Built target metis
/home/ygong07/tvm/python/tvm/driver/build_module.py:242: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
  warnings.warn(
[ 34%] Built target featgraph_kernel
[ 35%] Built target featgraph_runtime
[ 35%] Linking CXX shared library libdgl.so
[100%] Built target dgl

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ python3 test.py 
Using backend: pytorch
tensor([[[1.5832],
         [1.8842]],

        [[1.1876],
         [2.5858]],

        [[1.5149],
         [0.9924]],
         ...
[[2.2963],
         [1.3279]],

        [[1.7643],
         [1.2339]],

        [[2.3274],
         [1.7878]]], device='cuda:0')

[[[1.5831739]
  [1.8842214]]

 [[1.1875974]
  [2.5857563]]

 [[1.5148897]
  [0.9924001]]
....
[[2.2962904]
  [1.3278971]]

 [[1.7643319]
  [1.233911 ]]

 [[2.3274217]
  [1.7877729]]]

We run GCN and GAT scripts using dgl.sparse._CAPI_FG_LoadModule("/home/ygong07/dgl_src/dgl_tvm/dgl/build/featgraph/libfeatgraph_kernels.so")
The training time are same as DGL training time
Please let us know if you see any issues as these numbers will be reported in a research paper.

Thank you very much for your help.

Jun 13 '22 13:06 Ed-gong

Oh sorry, what I mean is the tvm-kernel branch.

Jun 20 '22 06:06 yzh119

Hi, the tvm-kernel branch you mentioned does not include the 'featGraph' folder. Therefore, I am not sure how to compile it specifically for featgraph and how to verify whether the featgraph is installed correctly or not. Could you provide me with more instructions? Thank you.

Jun 23 '22 15:06 Ed-gong

The tvm-kernel branch is fully Python based, and featgraph kernels would be triggered when you set the environment variable DGLENGINE to true.

See https://github.com/kira-lin/dgl/blob/tvm-kernel/python/dgl/sparse.py#L13-L16

Jun 27 '22 22:06 yzh119

Btw I do think you are not expected to see speedup using featgraph against DGL 0.8 because most of the optimized kernels have already been merged into DGL.

Jun 27 '22 22:06 yzh119

13 use_tvm = True if 'DGLENGINE' in os.environ and os.getenv('DGLENGINE') == 'tvm' else False
14 if use_tvm:
15     import tvm
16     from .tvm import gsddmm, gspmm

based on line 13, we make sure use_tvm is True, unfortunately, it crashes. When use_tvm is False, it does run, but I suspect it is calling DGL kernels.

We are still interested in running FeatGraph end-to-end. Do let us know if there are any other instructions.

Jul 07 '22 18:07 Ed-gong

Would you mind elaborating the error message so that we can debug why crashes?

Jul 10 '22 04:07 yzh119

Here is what the error I got:


(base) ygong07@mira0:~/compare_graphPy/GraphPy_GPU/build$ python3 GCN_pubmed_dgl.py
Using backend: pytorch
use_tvm True
Output of Read function is 
/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/base.py:45: DGLWarning: Recommend creating graphs by `dgl.graph(data)` instead of `dgl.DGLGraph(data)`.
  return warnings.warn(message, category=category, stacklevel=1)
graph creation time is: 0:00:00.029156
Traceback (most recent call last):
  File "GCN_pubmed_dgl.py", line 244, in <module>
    logits = net(graph, feature)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "GCN_pubmed_dgl.py", line 193, in forward
    h = self.conv1(g, inputs)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/nn/pytorch/conv/graphconv.py", line 269, in forward
    graph.update_all(fn.copy_src(src='h', out='m'),
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/heterograph.py", line 4499, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
    z = op(graph, x)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 171, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 62, in gspmm
    ret = gspmm_internal(g._graph, op,
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 87, in _gspmm
    return _gspmm_tvm(gidx, op, reduce_op, u, e) if use_tvm \
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 373, in _gspmm_tvm
    mod = gspmm.spmm(
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/tvm/gspmm.py", line 301, in spmm
    if topi.util.get_const_int(topi.util.prod(out.shape[1:])) < 16:
AttributeError: module 'tvm.topi' has no attribute 'util'

Jul 23 '22 18:07 Ed-gong

This is due to the TVM version, you should use TVM 0.7.

Jul 24 '22 00:07 yzh119

FeatGraph FeatGraph copied to clipboard

use end-to-end DGL scripts run featGraph

FeatGraph
FeatGraph copied to clipboard