dgl icon indicating copy to clipboard operation
dgl copied to clipboard

EGNNConv Leading to NaN Gradients for Identical Input Coordinates

Open Shadow-Dream opened this issue 10 months ago • 2 comments

🐛 Bug

I am reaching out to report a potential issue I encountered while working with the dgl.nn.pytorch.EGNNConv module, which seems to produce NaN gradients when the input coordinates are identical across multiple instances.

While using the dgl.nn.pytorch.EGNNConv, I noticed that feeding in a position vector filled with zeros and stacking the same EGNNConv layer three times results in NaN gradients. Upon reviewing the code for EGNNConv, I suspect the issue might be related to the interaction between the position and the input features in the MLP. Could this be an problem with the EGNN formula, or might it stem from an implementation flaw?

To Reproduce

Steps to reproduce the behavior:

from dgl.nn.pytorch import EGNNConv
import torch
from torch.optim import Adam
graph = dgl.graph(([0,1,2,1,2,3],[1,2,3,0,1,2]), num_nodes=10).to("cuda")
model = EGNNConv(64, 64, 64, 64).cuda()
optimizer = Adam(model.parameters(), lr = 1e-4)

for i in range(10000):
    optimizer.zero_grad()
    position = torch.zeros([10, 3]).float().cuda()
    node_features = torch.randn([10, 64]).float().cuda()
    edge_features = torch.randn([6, 64]).float().cuda()
    node_features, position = model(graph, node_features, position, edge_features)
    node_features, position = model(graph, node_features, position, edge_features)
    node_features, position = model(graph, node_features, position, edge_features)
    loss = (node_features**2).mean()
    loss.backward()
    optimizer.step()
    print(loss.item())

Expected behavior

Environment

  • DGL Version (e.g., 1.0): 0.9.1.post1
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): 1.12.1+cu113
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed DGL (conda, pip, source): pip
  • Build command you used (if compiling from source):
  • Python version: 3.9.18
  • CUDA/cuDNN version (if applicable): CUDA 11.3
  • GPU models and configuration (e.g. V100): RTX 4090
  • Any other relevant information:

Additional context

Shadow-Dream avatar Apr 04 '24 14:04 Shadow-Dream

Does it still occur in the latest DGL version?

BarclayII avatar Apr 11 '24 02:04 BarclayII

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] avatar May 12 '24 01:05 github-actions[bot]

Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.

frozenbugs avatar May 23 '24 01:05 frozenbugs