IndexError when initializing torch_geometric.nn.Sequential during multiprocessing
🐛 Describe the bug
Problem description:
Hello,
I recently encountered an IndexError when attempting to initialize torch_geometric.nn.Sequential within a multiprocessing environment. My suspicion is that due to the shared nature of multiprocessing, the ID of the Sequential module might be the same across multiple processes, leading to conflicts and incorrect indexing.
I would greatly appreciate any suggestions on how to address this issue. :)
Code to reproduce:
def construct_model(args):
import torch_geometric.nn as gnn
gcn_type = 'chebconv'
input_args = 'x, edge_index, edge_weight'
channel_list = [[32, 16, 1], [32, 1]]
for t in range(100):
gcn_layers = []
for channel in channel_list:
for i in range(len(channel) - 1):
_gcn = gnn.ChebConv(in_channels=channel[i], out_channels=channel[i + 1], K=3)
gcn_layers.append(
(_gcn,
'x, edge_index, edge_weight -> x')
)
built_layers = gnn.Sequential(input_args=input_args, modules=gcn_layers)
print(channel, gcn_layers, built_layers)
if __name__ == '__main__':
import torch.multiprocessing as mp
from tqdm import tqdm
num_partitions = 100
num_processes = 2
ctx = mp.get_context("spawn")
with ctx.Pool(num_processes) as pool:
with tqdm(total=num_partitions) as pbar:
for i, res in enumerate(pool.imap_unordered(construct_model, [1 for i in range(num_partitions)])): # set up the pool
pbar.update()
# pool.apply_async(run_model, args=(args,))
pbar.close()
pool.close() # close the pool
pool.join() # join the pool
Error message:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/miniforge3/envs/deeplearning/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/test.py", line 19, in construct_model
print(channel, gcn_layers, built_layers)
File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 645, in __repr__
module_reprs = [
File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 646, in <listcomp>
f' ({i}) - {self[i]}: {self._module_descs[i]}'
File "/tmp/torch_geometric.nn.sequential_8f1555_pbw5o1ex.py", line 639, in __getitem__
return getattr(self, self._module_names[idx])
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/test.py", line 31, in <module>
for i, res in enumerate(pool.imap_unordered(construct_model, [1 for i in range(num_partitions)])): # set up the pool
File "/opt/miniforge3/envs/deeplearning/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
IndexError: list index out of range
Versions
PyTorch version: 2.2.1
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Anaconda gcc) 11.2.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35
Python version: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.19.0-1010-nvidia-lowlatency-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A40
Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.2.1
[pip3] torch_geometric==2.5.2
[pip3] triton==2.2.0
[conda] blas 1.0 mkl
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py39h5eee18b_1
[conda] mkl_fft 1.3.8 py39h5eee18b_0
[conda] mkl_random 1.2.4 py39hdb19cb5_0
[conda] numpy 1.26.4 py39h5f9d8c6_0
[conda] numpy-base 1.26.4 py39hb5e798b_0
[conda] pyg 2.5.2 py39_torch_2.2.0_cu121 pyg
[conda] pytorch 2.2.1 py3.9_cuda12.1_cudnn8.9.2_0 pytorch
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torchtriton 2.2.0 py39 pytorch
Potential cause
I suspect the issue is due to the loading of wrong Sequential module likely due to the random uid being used to generate the /tmp/torch_geometric.nn.sequential_{uid}.py file.
Explanation
In your construct_model function, you seem to create Sequential module with 3 layers and 2 layers. If the current Sequential module is,
Sequential(
(0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
(1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)
and the /tmp/torch_geometric.nn.sequential_{uid}.py file loaded from the module_from_template contains Sequential module,
Sequential(
(0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x
(1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
(2) - ChebConv(32, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x
)
then you might get an index error when executing the print statement since the loaded Sequential module has len(self) of 3 but the _module_names contains only 2 values.
Potential cause
I suspect the issue is due to the loading of wrong
Sequentialmodule likely due to the randomuidbeing used to generate the /tmp/torch_geometric.nn.sequential_{uid}.py file.Explanation
In your
construct_modelfunction, you seem to createSequentialmodule with 3 layers and 2 layers. If the currentSequentialmodule is,Sequential( (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x )and the
/tmp/torch_geometric.nn.sequential_{uid}.pyfile loaded from the module_from_template containsSequentialmodule,Sequential( (0) - ChebConv(32, 16, K=3, normalization=sym): x, edge_index, edge_weight -> x (1) - ChebConv(16, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x (2) - ChebConv(32, 1, K=3, normalization=sym): x, edge_index, edge_weight -> x )then you might get an index error when executing the print statement since the loaded
Sequentialmodule haslen(self)of 3 but the _module_names contains only 2 values.
Thanks for your hint.
In my practice, the torch_geometric.nn.Sequential will be confused and might raise an error if many objects are created, no matter whether in a multiprocess environment, because I set the same random seed at the beginning of my function.
Right now, my temporal solution is using torch.nn.ModuleDict instead of torch_geometric.nn.Sequential and making some adjustments to my code.
Your original code should work fine if you install PyG from source (pip install git+https://github.com/pyg-team/pytorch_geometric.git). Recent version of PyG 2.6.0 contains Sequential module that has been refactored,
- Logic of creating
Sequentialclass on the fly has been moved fromsequential.jinjafile tosequential.pyfile. - Only the forward function is created by the
sequential.jinjafile.
Take a look at #9369.