dgl [Bug] Using LIBXSMM gives a wrong SpMM output for cetain cases

[Bug] Using LIBXSMM gives a wrong SpMM output for cetain cases

Open davidmin7 opened this issue 3 years ago • 1 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Build the latest (or anything after 0.7) using cmake -DUSE_CUDA=ON -DBUILD_TORCH=ON -DUSE_LIBXSMM=ON .. and run the following code.

import torch
import dgl
import dgl.function as fn
from ogb.nodeproppred import DglNodePropPredDataset

data = DglNodePropPredDataset(name='ogbn-papers100M', root='./')
g, labels = data[0]
g = dgl.to_bidirected(g)
g.ndata['temp'] = torch.ones(g.num_nodes())
g.update_all(fn.copy_u('temp', 'm'), fn.sum('m', 'temp'))

The resulting outputs of g.ndata['temp'] are all zeros.

>>> g.ndata['temp']
tensor([0., 0., 0.,  ..., 0., 0., 0.])

Expected behavior

Build the latest (or anything after 0.7) using cmake -DUSE_CUDA=ON -DBUILD_TORCH=ON -DUSE_LIBXSMM=OFF .. and run the following code.

import torch
import dgl
import dgl.function as fn
from ogb.nodeproppred import DglNodePropPredDataset

data = DglNodePropPredDataset(name='ogbn-papers100M', root='./')
g, labels = data[0]
g = dgl.to_bidirected(g)
g.ndata['temp'] = torch.ones(g.num_nodes())
g.update_all(fn.copy_u('temp', 'm'), fn.sum('m', 'temp'))

The resulting outputs of g.ndata['temp'] are now correct.

>>> g.ndata['temp']
tensor([ 1.,  8.,  7.,  ..., 15.,  1.,  2.])

Environment

DGL Version (e.g., 1.0): master, cac25f63
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.9
OS (e.g., Linux): CentOS 8
How you installed DGL (conda, pip, source): source
Build command you used (if compiling from source): Explained above
Python version: 3.8
CUDA/cuDNN version (if applicable): 10.2
GPU models and configuration (e.g. V100): V100
Any other relevant information: Intel Xeon Gold 6230

Additional context

Not entirely sure, but seems like the problem is occurring when the number edges is very large.

Nov 01 '21 23:11 davidmin7

Let me look into this one

Nov 08 '21 07:11 sanchit-misra

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Aug 28 '22 01:08 github-actions[bot]

dgl dgl copied to clipboard

[Bug] Using LIBXSMM gives a wrong SpMM output for cetain cases

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

dgl
dgl copied to clipboard