dgl
dgl copied to clipboard
EGNNConv Leading to NaN Gradients for Identical Input Coordinates
🐛 Bug
I am reaching out to report a potential issue I encountered while working with the dgl.nn.pytorch.EGNNConv module, which seems to produce NaN gradients when the input coordinates are identical across multiple instances.
While using the dgl.nn.pytorch.EGNNConv, I noticed that feeding in a position vector filled with zeros and stacking the same EGNNConv layer three times results in NaN gradients. Upon reviewing the code for EGNNConv, I suspect the issue might be related to the interaction between the position and the input features in the MLP. Could this be an problem with the EGNN formula, or might it stem from an implementation flaw?
To Reproduce
Steps to reproduce the behavior:
from dgl.nn.pytorch import EGNNConv
import torch
from torch.optim import Adam
graph = dgl.graph(([0,1,2,1,2,3],[1,2,3,0,1,2]), num_nodes=10).to("cuda")
model = EGNNConv(64, 64, 64, 64).cuda()
optimizer = Adam(model.parameters(), lr = 1e-4)
for i in range(10000):
optimizer.zero_grad()
position = torch.zeros([10, 3]).float().cuda()
node_features = torch.randn([10, 64]).float().cuda()
edge_features = torch.randn([6, 64]).float().cuda()
node_features, position = model(graph, node_features, position, edge_features)
node_features, position = model(graph, node_features, position, edge_features)
node_features, position = model(graph, node_features, position, edge_features)
loss = (node_features**2).mean()
loss.backward()
optimizer.step()
print(loss.item())
Expected behavior
Environment
- DGL Version (e.g., 1.0): 0.9.1.post1
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): 1.12.1+cu113
- OS (e.g., Linux): Ubuntu 18.04
- How you installed DGL (
conda
,pip
, source): pip - Build command you used (if compiling from source):
- Python version: 3.9.18
- CUDA/cuDNN version (if applicable): CUDA 11.3
- GPU models and configuration (e.g. V100): RTX 4090
- Any other relevant information:
Additional context
Does it still occur in the latest DGL version?
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.