FeatGraph icon indicating copy to clipboard operation
FeatGraph copied to clipboard

use end-to-end DGL scripts run featGraph

Open Ed-gong opened this issue 2 years ago • 17 comments

Hi, I want to run the featGraph end-to-end. I have already built the DGL (with featGraph) and run the test.py file successfully using the instructions posted in https://github.com/dmlc/dgl/tree/master/featgraph.

  • If I want to run an end-to-end GCN training on Pubmed or Reddit dataset, can I just use the DGL GCN benchmark script I have before without changing any kernel names? In other words, which parts of the code of DGL python script do I need to change so that I can run the featGraph(not DGL) end-to-end? Thank you.

Ed-gong avatar May 23 '22 17:05 Ed-gong

You might checkout this branch of DGL:

https://github.com/kira-lin/dgl/tree/tvm_integration

yzh119 avatar May 23 '22 21:05 yzh119

Thanks for your reply. I just clarified my question by re-editing the post above. Can you respond again? Thank you.

Ed-gong avatar May 24 '22 18:05 Ed-gong

I used the DGL test scripts to run the GCN on PubMed and Cora dataset with extra one line of code: dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so") The python script works fine without any error. But the training time of featGraph is the same as DGL. It seems like featGraph does not improve any training time efficiency.

Ed-gong avatar Jun 01 '22 21:06 Ed-gong

I don't think Featgraph has better performance against cusparse for GCN on GPU, see table IV in the paper, since DGL uses cusparse, it's normal that you don't observe any acceleration here.

yzh119 avatar Jun 01 '22 22:06 yzh119

Thank you very much for your response. I am closing this issue.

Ed-gong avatar Jun 02 '22 20:06 Ed-gong

Sorry I just noticed that you were using dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so") to use featgraph as backend, actually the integration was abandoned because TVM do not have native sparse support and we might encounter several issues when used in production, so you will still be using DGL's native backend in most cases even if load the module.

Only the branch I mentioned (https://github.com/kira-lin/dgl/tree/tvm_integration) contains the complete code that uses featgraph backend. Regarding the question in #14 , yes GAT is also supported (it was mentioned in the paper), and we can use it by compiling the tvm_integration branch.

yzh119 avatar Jun 07 '22 21:06 yzh119

If you are interested in native sparse support of TVM, our work is coming soon, please stay tuned.

yzh119 avatar Jun 07 '22 21:06 yzh119

Hi, thank you for the kind response. For the branch https://github.com/kira-lin/dgl/tree/tvm_integration, If I want to use the featGraph backend, what is the specific python code I needed to write? For example, If I only write dgl.sparse._CAPI_FG_LoadModule("../build/featgraph/libfeatgraph_kernels.so"), will the featGraph backend be used automatically? If not, which python code do I need to use so that the I can use the featGraph GCN and GAT backend ?

The ReadMe file in https://github.com/kira-lin/dgl/tree/tvm_integration/featgraph only shows to run test.py to verify the. correctness. However, the test.py only contains a test case kernel: dgl.sparse._CAPI_FG_SDDMMTreeReduction(gidx, u, v, e) for sddmm kernels. It is a little bit hard for me to know how to run other featGraph kernel backends. Could you provide more detailed instructions about which python code I need to write so that I can use the featGraph GCN and GAT backend kernels? Thank you.

Ed-gong avatar Jun 10 '22 17:06 Ed-gong

This is the step we followed:

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ git branch
  master
* tvm_integration
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ pwd
/home/ygong07/dgl_src/dgl_tvm/dgl/build
(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ cmake -DUSE_CUDA=ON -DUSE_TVM=ON ..
-- Start configuring project dgl
-- Build with CUDA support
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- -fopenmp -O2 -Wall -fPIC -std=c++11  -DUSE_AVX -DIDXTYPEWIDTH=64 -DREALTYPEWIDTH=32
-- Running GPU architecture autodetection
nvcc warning : The 'compute_35', 'compute_37', 'compute_50', 'sm_35', 'sm_37' and 'sm_50' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).
-- Found CUDA arch 8.0
-- CUDA flags: -Xcompiler ,-fopenmp,-O2,-Wall,-fPIC,,,-DUSE_AVX,-DIDXTYPEWIDTH=64,-DREALTYPEWIDTH=32;-gencode;arch=compute_80,code=sm_80;--expt-extended-lambda;-Wno-deprecated-declarations;-std=c++14
-- Found OpenMP_C: -fopenmp  
-- Found OpenMP_CXX: -fopenmp  
-- /home/ygong07/dgl_src/dgl_tvm/dgl/third_party/dmlc-core/cmake/build_config.h.in -> include/dmlc/build_config.h
-- Start configuring project featgraph
-- Found CUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-11.2
-- Found CUDA_CUDART_LIBRARY=/usr/local/cuda-11.2/lib64/libcudart.so
-- Found CUDA_CUBLAS_LIBRARY=/usr/lib/x86_64-linux-gnu/libcublas.so
-- /usr/local/cuda-11.2/include
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ygong07/dgl_src/dgl_tvm/dgl/build

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/build$ make -j4
[  1%] Creating featgraph kernels...
[  6%] Built target dmlc
[ 34%] Built target metis
/home/ygong07/tvm/python/tvm/driver/build_module.py:242: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.
  warnings.warn(
[ 34%] Built target featgraph_kernel
[ 35%] Built target featgraph_runtime
[ 35%] Linking CXX shared library libdgl.so
[100%] Built target dgl

(base) ygong07@mira0:~/dgl_src/dgl_tvm/dgl/featgraph$ python3 test.py 
Using backend: pytorch
tensor([[[1.5832],
         [1.8842]],

        [[1.1876],
         [2.5858]],

        [[1.5149],
         [0.9924]],
         ...
[[2.2963],
         [1.3279]],

        [[1.7643],
         [1.2339]],

        [[2.3274],
         [1.7878]]], device='cuda:0')

[[[1.5831739]
  [1.8842214]]

 [[1.1875974]
  [2.5857563]]

 [[1.5148897]
  [0.9924001]]
....
[[2.2962904]
  [1.3278971]]

 [[1.7643319]
  [1.233911 ]]

 [[2.3274217]
  [1.7877729]]]

  • We run GCN and GAT scripts using dgl.sparse._CAPI_FG_LoadModule("/home/ygong07/dgl_src/dgl_tvm/dgl/build/featgraph/libfeatgraph_kernels.so")
  • The training time are same as DGL training time
  • Please let us know if you see any issues as these numbers will be reported in a research paper.

Thank you very much for your help.

Ed-gong avatar Jun 13 '22 13:06 Ed-gong

Oh sorry, what I mean is the tvm-kernel branch.

yzh119 avatar Jun 20 '22 06:06 yzh119

Hi, the tvm-kernel branch you mentioned does not include the 'featGraph' folder. Therefore, I am not sure how to compile it specifically for featgraph and how to verify whether the featgraph is installed correctly or not. Could you provide me with more instructions? Thank you.

Ed-gong avatar Jun 23 '22 15:06 Ed-gong

The tvm-kernel branch is fully Python based, and featgraph kernels would be triggered when you set the environment variable DGLENGINE to true.

See https://github.com/kira-lin/dgl/blob/tvm-kernel/python/dgl/sparse.py#L13-L16

yzh119 avatar Jun 27 '22 22:06 yzh119

Btw I do think you are not expected to see speedup using featgraph against DGL 0.8 because most of the optimized kernels have already been merged into DGL.

yzh119 avatar Jun 27 '22 22:06 yzh119

13 use_tvm = True if 'DGLENGINE' in os.environ and os.getenv('DGLENGINE') == 'tvm' else False
14 if use_tvm:
15     import tvm
16     from .tvm import gsddmm, gspmm

based on line 13, we make sure use_tvm is True, unfortunately, it crashes. When use_tvm is False, it does run, but I suspect it is calling DGL kernels.

We are still interested in running FeatGraph end-to-end. Do let us know if there are any other instructions.

Ed-gong avatar Jul 07 '22 18:07 Ed-gong

Would you mind elaborating the error message so that we can debug why crashes?

yzh119 avatar Jul 10 '22 04:07 yzh119

Here is what the error I got:


(base) ygong07@mira0:~/compare_graphPy/GraphPy_GPU/build$ python3 GCN_pubmed_dgl.py
Using backend: pytorch
use_tvm True
Output of Read function is 
/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/base.py:45: DGLWarning: Recommend creating graphs by `dgl.graph(data)` instead of `dgl.DGLGraph(data)`.
  return warnings.warn(message, category=category, stacklevel=1)
graph creation time is: 0:00:00.029156
Traceback (most recent call last):
  File "GCN_pubmed_dgl.py", line 244, in <module>
    logits = net(graph, feature)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "GCN_pubmed_dgl.py", line 193, in forward
    h = self.conv1(g, inputs)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/nn/pytorch/conv/graphconv.py", line 269, in forward
    graph.update_all(fn.copy_src(src='h', out='m'),
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/heterograph.py", line 4499, in update_all
    ndata = core.message_passing(g, message_func, reduce_func, apply_node_func)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 283, in message_passing
    ndata = invoke_gspmm(g, mfunc, rfunc)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/core.py", line 255, in invoke_gspmm
    z = op(graph, x)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 171, in func
    return gspmm(g, 'copy_lhs', reduce_op, x, None)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/ops/spmm.py", line 62, in gspmm
    ret = gspmm_internal(g._graph, op,
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 235, in gspmm
    return GSpMM.apply(gidx, op, reduce_op, lhs_data, rhs_data)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/backend/pytorch/sparse.py", line 64, in forward
    out, (argX, argY) = _gspmm(gidx, op, reduce_op, X, Y)
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 87, in _gspmm
    return _gspmm_tvm(gidx, op, reduce_op, u, e) if use_tvm \
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/sparse.py", line 373, in _gspmm_tvm
    mod = gspmm.spmm(
  File "/home/ygong07/anaconda3/lib/python3.8/site-packages/dgl-0.6-py3.8-linux-x86_64.egg/dgl/tvm/gspmm.py", line 301, in spmm
    if topi.util.get_const_int(topi.util.prod(out.shape[1:])) < 16:
AttributeError: module 'tvm.topi' has no attribute 'util'

Ed-gong avatar Jul 23 '22 18:07 Ed-gong

This is due to the TVM version, you should use TVM 0.7.

yzh119 avatar Jul 24 '22 00:07 yzh119