pytorch_geometric
pytorch_geometric copied to clipboard
Parallel inference of PyG models hangs on CPU
🐛 Describe the bug
The below code is meant to in parallel perform inference on CPU of multiple (tiny) PyG GCN-models. However, the code hangs upon the forward call.
import torch
from torch import multiprocessing as mp
import time
import os.path as osp
import torch.nn.functional as F
import torch_geometric.transforms as T
#==========
#Helper function to load the "Planetoid" data set
#==========
from torch_geometric.datasets import Planetoid
def getData():
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Planetoid')
dataset = Planetoid(path, 'Cora', transform=T.NormalizeFeatures())
return dataset
#==========
#Define a pytorch geometric model
#==========
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
def __init__(self, in_channels, hidden_channels, out_channels):
super().__init__()
self.conv1 = GCNConv(in_channels, hidden_channels, cached=False)
self.conv2 = GCNConv(hidden_channels, out_channels, cached=False)
def forward(self, x, edge_index, edge_weight=None):
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv1(x, edge_index, edge_weight).relu()
x = F.dropout(x, p=0.5, training=self.training)
x = self.conv2(x, edge_index, edge_weight)
return x
#==========
#Inference of a graph neural network
#==========
def evaluate(args):
(pid, policies, results) = args
model = policies[pid]
print("ThreadID:", pid, "Input data dimensions: ",data.x.shape,flush=True)
pred = model(data.x, data.edge_index, data.edge_weight).max()
print("Inference completed")
results[pid] = pred
#==========
#Create a list of 10 random GCN's
#==========
dataset = getData()
data = dataset[0]
policies = []
for _ in range(10):
model = GCN(dataset.num_features, 100, dataset.num_classes)
model.to("cpu")
model.eval()
policies.append(model)
#==========
#Show that the inference of 1 GCN works well and completes quickly
#==========
PROCESSES_NUM = 5
POLCIIES_NUM = len(policies)
results = torch.zeros(len(policies))
# T1 = time.time()
# evaluate((0,policies,results))
# print("Inference time of 1 model: ",time.time()-T1)
#==========
#Attempt to process the 10models in paralell on the same data
#==========
with mp.Pool(PROCESSES_NUM) as pool:
pool.map(evaluate, [(i, policies, results) for i in range(POLCIIES_NUM)])
#=========
#The problem: The code will get stuck in the forward method of the GCN
#=========
print("Done")
Output:
ThreadID: 0 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 1 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 3 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 4 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 2 Input data dimensions: torch.Size([2708, 1433])
Expected output:
ThreadID: 0 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 1 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 3 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 4 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 2 Input data dimensions: torch.Size([2708, 1433])
Inference completed
Inference completed
Inference completed
Inference completed
Inference completed
ThreadID: 8 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 6 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 7 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 5 Input data dimensions: torch.Size([2708, 1433])
ThreadID: 9 Input data dimensions: torch.Size([2708, 1433])
Inference completed
Inference completed
Inference completed
Inference completed
Inference completed
Done
Environment
- PyG version: 2.0.4
- PyTorch version: 1.12.0 ( py3.9_cpu_0)
- OS: ubuntu Ubuntu 20.04.4 LTS
- Python version: 3.9.7
- CUDA/cuDNN version: n/a
- How you installed PyTorch and PyG (
conda,pip, source): PyTorch: Conda, PyG: Pip (according to instructions here) - Other relevant information:
- Output of
conda list | grep torch:- cpuonly 2.0 0 pytorch
- ffmpeg 4.3 hf484d3e_0 pytorch
- pytorch 1.12.0 py3.9_cpu_0 pytorch
- pytorch-lightning 1.5.10 pyhd8ed1ab_0 conda-forge
- pytorch-mutex 1.0 cpu pytorch
- torch-cluster 1.6.0 pypi_0 pypi
- torch-geometric 2.0.4 pypi_0 pypi
- torch-scatter 2.0.9 pypi_0 pypi
- torch-sparse 0.6.14 pypi_0 pypi
- torch-spline-conv 1.2.1 pypi_0 pypi
- torch-tb-profiler 0.4.0 pypi_0 pypi
- torchaudio 0.12.0 py39_cpu pytorch
- torchmetrics 0.8.2 pyhd8ed1ab_0 conda-forge
- torchvision 0.13.0 py39_cpu pytorch
- Output of
For reference I attach here a non-PyG example that follows the same parallel processing procedure that does work. multi_processing_torch.txt (please change the suffix of the file from .txt to .py)
The code runs fine for me on PyG 2.0.4/master, PyTorch 1.12 and Ubuntu 18.04, so I cannot really debug this further. Is there any chance you can find out where the program gets stuck on your end?
cc @mananshah99. Can you try out as well?
@rusty1s did you use pytorch cpu or gpu? (I use pytorch 1.12.0 py3.9_cpu_0 pytorch)
I used PyTorch with CUDA.
I added a few more log traces in the GCN forward implementation. Running this code shows that execution doesn't pass the first convultion.
def forward(self, x, edge_index, edge_weight=None):
print("dropout1",flush=True)
x = F.dropout(x, p=0.5, training=self.training)
print("conv1",flush=True)
x = self.conv1(x, edge_index, edge_weight)
print("dropout2",flush=True)
x = F.dropout(x, p=0.5, training=self.training)
print("conv2",flush=True)
x = self.conv2(x, edge_index, edge_weight)
print("END",flush=True)
return x
Output:
ThreadID: 0 Input data dimensions: torch.Size([2708, 1433])
dropout1
conv1
ThreadID: 1 Input data dimensions: torch.Size([2708, 1433])
dropout1
ThreadID: 3 Input data dimensions: conv1torch.Size([2708, 1433])
ThreadID: 2 Input data dimensions: torch.Size([2708, 1433])
dropout1
dropout1
conv1
conv1
ThreadID: 4 Input data dimensions: torch.Size([2708, 1433])
dropout1
conv1
Another hint is that if I run the code in "Debug" (in VS code) some of the GCN forward calls do execute to the end (but not all).
Any chance you can debug inside the GNN layers?
Yes. The line where the code gets stuck is:
edge_attr = torch.cat([edge_attr[mask], loop_attr], dim=0)
of torch_geometric/utils/loop.py method add_remaining_self_loops
mask.shape= torch.Size([10556]) loop_attr.shape= torch.Size([2708]) edge_attr.shape= torch.Size([10556])
Installed from source using the command: pip install git+https://github.com/pyg-team/pytorch_geometric.git When debugging with VS Code, the debugger doesn't stop at my PyG breakpoints. Not an expert yet of VS Code. Maybe you know what's required in order to have the debugger step in to installed sources.
Sorry for late reply. I am not a VSCode user, but maybe you can dig deeper with good old print debugging at certain points within MessagePassing?