pytorch_geometric
pytorch_geometric copied to clipboard
Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs
🐛 Describe the bug
My code is like this:
***The code for creating graph, GNN model***
train_loader = NeighborLoader(
train_data,
num_neighbors=[2] * 2,
batch_size=train_batch_size,
input_nodes='case',
shuffle=True,
num_workers=4,
)
test_loader = NeighborLoader(
test_data,
num_neighbors=[2] * 2,
batch_size=train_batch_size,
input_nodes='case',
shuffle=True,
num_workers=4,
)
***The code to train and test***
(I need the subgraph sampled in test_loader to be random, so I put shuffle=True
and use n_id attribute to rearrange the predicted logits)
I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process.
Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.
Environment
- PyG version: 2.1.0.dev20220815
- PyTorch version: 1.11.0
- OS: Linux
- Python version: 3.8.13
- CUDA/cuDNN version: cuda10.2 cudnn7.6.5
- How you installed PyTorch and PyG (
conda
,pip
, source): PyTorch:conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch
PyG:
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install pyg-nightly
- Any other relevant information (e.g., version of
torch-scatter
): torch-scatter 2.0.9 torch-sparse 0.6.14
Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!
Many workers accumulate variables may lead to out of memory? I guess.