dgl
dgl copied to clipboard
[Bug] Using LIBXSMM gives a wrong SpMM output for cetain cases
🐛 Bug
To Reproduce
Steps to reproduce the behavior:
- Build the latest (or anything after 0.7) using
cmake -DUSE_CUDA=ON -DBUILD_TORCH=ON -DUSE_LIBXSMM=ON ..and run the following code.
import torch
import dgl
import dgl.function as fn
from ogb.nodeproppred import DglNodePropPredDataset
data = DglNodePropPredDataset(name='ogbn-papers100M', root='./')
g, labels = data[0]
g = dgl.to_bidirected(g)
g.ndata['temp'] = torch.ones(g.num_nodes())
g.update_all(fn.copy_u('temp', 'm'), fn.sum('m', 'temp'))
- The resulting outputs of
g.ndata['temp']are all zeros.
>>> g.ndata['temp']
tensor([0., 0., 0., ..., 0., 0., 0.])
Expected behavior
- Build the latest (or anything after 0.7) using
cmake -DUSE_CUDA=ON -DBUILD_TORCH=ON -DUSE_LIBXSMM=OFF ..and run the following code.
import torch
import dgl
import dgl.function as fn
from ogb.nodeproppred import DglNodePropPredDataset
data = DglNodePropPredDataset(name='ogbn-papers100M', root='./')
g, labels = data[0]
g = dgl.to_bidirected(g)
g.ndata['temp'] = torch.ones(g.num_nodes())
g.update_all(fn.copy_u('temp', 'm'), fn.sum('m', 'temp'))
- The resulting outputs of
g.ndata['temp']are now correct.
>>> g.ndata['temp']
tensor([ 1., 8., 7., ..., 15., 1., 2.])
Environment
- DGL Version (e.g., 1.0): master, cac25f63
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 1.9
- OS (e.g., Linux): CentOS 8
- How you installed DGL (
conda,pip, source): source - Build command you used (if compiling from source): Explained above
- Python version: 3.8
- CUDA/cuDNN version (if applicable): 10.2
- GPU models and configuration (e.g. V100): V100
- Any other relevant information: Intel Xeon Gold 6230
Additional context
Not entirely sure, but seems like the problem is occurring when the number edges is very large.
Let me look into this one
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you