pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Parallel inference of PyG models hangs on CPU

Open mattiasmar opened this issue 3 years ago • 12 comments

🐛 Describe the bug

The below code is meant to in parallel perform inference on CPU of multiple (tiny) PyG GCN-models. However, the code hangs upon the forward call.

import torch
from torch import multiprocessing as mp
import time
import os.path as osp
import torch.nn.functional as F
import torch_geometric.transforms as T

#==========
#Helper function to load the "Planetoid" data set
#==========
from torch_geometric.datasets import Planetoid
def getData():
    path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Planetoid')
    dataset = Planetoid(path, 'Cora', transform=T.NormalizeFeatures())
    return dataset

#==========
#Define a pytorch geometric model
#==========
from torch_geometric.nn import GCNConv
class GCN(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels, cached=False)
        self.conv2 = GCNConv(hidden_channels, out_channels, cached=False)

    def forward(self, x, edge_index, edge_weight=None):
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv1(x, edge_index, edge_weight).relu()
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index, edge_weight)
        return x
 

#==========
#Inference of a graph neural network
#==========
def evaluate(args):
    (pid, policies, results) = args
    model = policies[pid]
    print("ThreadID:", pid, "Input data dimensions: ",data.x.shape,flush=True)
    pred = model(data.x, data.edge_index, data.edge_weight).max()
    print("Inference completed")
    results[pid] = pred
       

#==========
#Create a list of 10 random GCN's
#==========
dataset = getData()
data = dataset[0]
policies = []
for _ in range(10):
    model = GCN(dataset.num_features, 100, dataset.num_classes)
    model.to("cpu")
    model.eval()
    policies.append(model)


#==========
#Show that the inference of 1 GCN works well and completes quickly
#==========
PROCESSES_NUM = 5
POLCIIES_NUM = len(policies)
results = torch.zeros(len(policies))
# T1 = time.time()
# evaluate((0,policies,results))
# print("Inference time of 1 model: ",time.time()-T1)


#==========
#Attempt to process the 10models in paralell on the same data
#==========
with mp.Pool(PROCESSES_NUM) as pool:
    pool.map(evaluate, [(i, policies, results) for i in range(POLCIIES_NUM)])
    #=========
    #The problem: The code will get stuck in the forward method of the GCN
    #=========
print("Done")


Output:

ThreadID: 0 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 1 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 3 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 4 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 2 Input data dimensions:  torch.Size([2708, 1433])

Expected output:

ThreadID: 0 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 1 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 3 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 4 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 2 Input data dimensions:  torch.Size([2708, 1433])
Inference completed
Inference completed
Inference completed
Inference completed
Inference completed
ThreadID: 8 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 6 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 7 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 5 Input data dimensions:  torch.Size([2708, 1433])
ThreadID: 9 Input data dimensions:  torch.Size([2708, 1433])
Inference completed
Inference completed
Inference completed
Inference completed
Inference completed
Done

Environment

  • PyG version: 2.0.4
  • PyTorch version: 1.12.0 ( py3.9_cpu_0)
  • OS: ubuntu Ubuntu 20.04.4 LTS
  • Python version: 3.9.7
  • CUDA/cuDNN version: n/a
  • How you installed PyTorch and PyG (conda, pip, source): PyTorch: Conda, PyG: Pip (according to instructions here)
  • Other relevant information:
    • Output of conda list | grep torch:
      • cpuonly 2.0 0 pytorch
      • ffmpeg 4.3 hf484d3e_0 pytorch
      • pytorch 1.12.0 py3.9_cpu_0 pytorch
      • pytorch-lightning 1.5.10 pyhd8ed1ab_0 conda-forge
      • pytorch-mutex 1.0 cpu pytorch
      • torch-cluster 1.6.0 pypi_0 pypi
      • torch-geometric 2.0.4 pypi_0 pypi
      • torch-scatter 2.0.9 pypi_0 pypi
      • torch-sparse 0.6.14 pypi_0 pypi
      • torch-spline-conv 1.2.1 pypi_0 pypi
      • torch-tb-profiler 0.4.0 pypi_0 pypi
      • torchaudio 0.12.0 py39_cpu pytorch
      • torchmetrics 0.8.2 pyhd8ed1ab_0 conda-forge
      • torchvision 0.13.0 py39_cpu pytorch

mattiasmar avatar Jul 26 '22 12:07 mattiasmar

For reference I attach here a non-PyG example that follows the same parallel processing procedure that does work. multi_processing_torch.txt (please change the suffix of the file from .txt to .py)

mattiasmar avatar Jul 26 '22 12:07 mattiasmar

The code runs fine for me on PyG 2.0.4/master, PyTorch 1.12 and Ubuntu 18.04, so I cannot really debug this further. Is there any chance you can find out where the program gets stuck on your end?

rusty1s avatar Jul 30 '22 10:07 rusty1s

cc @mananshah99. Can you try out as well?

rusty1s avatar Jul 30 '22 11:07 rusty1s

@rusty1s did you use pytorch cpu or gpu? (I use pytorch 1.12.0 py3.9_cpu_0 pytorch)

mattiasmar avatar Jul 30 '22 19:07 mattiasmar

I used PyTorch with CUDA.

rusty1s avatar Jul 30 '22 19:07 rusty1s

I added a few more log traces in the GCN forward implementation. Running this code shows that execution doesn't pass the first convultion.

    def forward(self, x, edge_index, edge_weight=None):
        print("dropout1",flush=True)
        x = F.dropout(x, p=0.5, training=self.training)
        print("conv1",flush=True)
        x = self.conv1(x, edge_index, edge_weight)
        print("dropout2",flush=True)
        x = F.dropout(x, p=0.5, training=self.training)
        print("conv2",flush=True)
        x = self.conv2(x, edge_index, edge_weight)
        print("END",flush=True)
        return x

Output:

ThreadID: 0 Input data dimensions:  torch.Size([2708, 1433])
dropout1
conv1
ThreadID: 1 Input data dimensions:  torch.Size([2708, 1433])
dropout1
ThreadID: 3 Input data dimensions:  conv1torch.Size([2708, 1433])

ThreadID: 2 Input data dimensions:  torch.Size([2708, 1433])
dropout1
dropout1
conv1
conv1
ThreadID: 4 Input data dimensions:  torch.Size([2708, 1433])
dropout1
conv1

mattiasmar avatar Jul 30 '22 19:07 mattiasmar

Another hint is that if I run the code in "Debug" (in VS code) some of the GCN forward calls do execute to the end (but not all).

mattiasmar avatar Jul 30 '22 19:07 mattiasmar

Any chance you can debug inside the GNN layers?

rusty1s avatar Jul 30 '22 20:07 rusty1s

Yes. The line where the code gets stuck is:

edge_attr = torch.cat([edge_attr[mask], loop_attr], dim=0)

of torch_geometric/utils/loop.py method add_remaining_self_loops

mattiasmar avatar Jul 30 '22 20:07 mattiasmar

mask.shape= torch.Size([10556]) loop_attr.shape= torch.Size([2708]) edge_attr.shape= torch.Size([10556])

mattiasmar avatar Jul 30 '22 20:07 mattiasmar

Installed from source using the command: pip install git+https://github.com/pyg-team/pytorch_geometric.git When debugging with VS Code, the debugger doesn't stop at my PyG breakpoints. Not an expert yet of VS Code. Maybe you know what's required in order to have the debugger step in to installed sources.

mattiasmar avatar Jul 30 '22 20:07 mattiasmar

Sorry for late reply. I am not a VSCode user, but maybe you can dig deeper with good old print debugging at certain points within MessagePassing?

rusty1s avatar Aug 02 '22 13:08 rusty1s