dgl EGNNConv Leading to NaN Gradients for Identical Input Coordinates

🐛 Bug

I am reaching out to report a potential issue I encountered while working with the dgl.nn.pytorch.EGNNConv module, which seems to produce NaN gradients when the input coordinates are identical across multiple instances.

While using the dgl.nn.pytorch.EGNNConv, I noticed that feeding in a position vector filled with zeros and stacking the same EGNNConv layer three times results in NaN gradients. Upon reviewing the code for EGNNConv, I suspect the issue might be related to the interaction between the position and the input features in the MLP. Could this be an problem with the EGNN formula, or might it stem from an implementation flaw?

To Reproduce

Steps to reproduce the behavior:

from dgl.nn.pytorch import EGNNConv
import torch
from torch.optim import Adam
graph = dgl.graph(([0,1,2,1,2,3],[1,2,3,0,1,2]), num_nodes=10).to("cuda")
model = EGNNConv(64, 64, 64, 64).cuda()
optimizer = Adam(model.parameters(), lr = 1e-4)

for i in range(10000):
    optimizer.zero_grad()
    position = torch.zeros([10, 3]).float().cuda()
    node_features = torch.randn([10, 64]).float().cuda()
    edge_features = torch.randn([6, 64]).float().cuda()
    node_features, position = model(graph, node_features, position, edge_features)
    node_features, position = model(graph, node_features, position, edge_features)
    node_features, position = model(graph, node_features, position, edge_features)
    loss = (node_features**2).mean()
    loss.backward()
    optimizer.step()
    print(loss.item())

Expected behavior

Environment

DGL Version (e.g., 1.0): 0.9.1.post1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): 1.12.1+cu113
OS (e.g., Linux): Ubuntu 18.04
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9.18
CUDA/cuDNN version (if applicable): CUDA 11.3
GPU models and configuration (e.g. V100): RTX 4090
Any other relevant information:

Additional context

Apr 04 '24 14:04 Shadow-Dream

Does it still occur in the latest DGL version?

Apr 11 '24 02:04 BarclayII

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

May 12 '24 01:05 github-actions[bot]

Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.

May 23 '24 01:05 frozenbugs

dgl dgl copied to clipboard

EGNNConv Leading to NaN Gradients for Identical Input Coordinates

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

dgl
dgl copied to clipboard