pytorch_geometric icon indicating copy to clipboard operation
pytorch_geometric copied to clipboard

Heterogenous graph, use NeighborLoader with num_workers>0, and stucks after many epochs

Open PolarisRisingWar opened this issue 1 year ago • 1 comments

🐛 Describe the bug

My code is like this:

***The code for creating graph, GNN model***

train_loader = NeighborLoader(
    train_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True,
    num_workers=4,
)

test_loader = NeighborLoader(
    test_data,
    num_neighbors=[2] * 2,
    batch_size=train_batch_size,
    input_nodes='case',
    shuffle=True, 
    num_workers=4,
)

***The code to train and test***

(I need the subgraph sampled in test_loader to be random, so I put shuffle=True and use n_id attribute to rearrange the predicted logits) I used W&B to log the train_losses and other metrics during training, but I found that after 80min and 6h (2 experiments) it stucks, the curve stop running for about 2 hours. I can only think it's because the num_workers cause after I deleted num_workers paramater, it can successfully finished the 22h process. Honestly it's hard for me to trace back the bug and reproduce it... So I can only just report this problem.

Environment

  • PyG version: 2.1.0.dev20220815
  • PyTorch version: 1.11.0
  • OS: Linux
  • Python version: 3.8.13
  • CUDA/cuDNN version: cuda10.2 cudnn7.6.5
  • How you installed PyTorch and PyG (conda, pip, source): PyTorch: conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=10.2 -c pytorch PyG:
pip install torch-scatter -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install torch-sparse -f https://data.pyg.org/whl/torch-1.11.0+cu102.html
pip install pyg-nightly
  • Any other relevant information (e.g., version of torch-scatter): torch-scatter 2.0.9 torch-sparse 0.6.14

PolarisRisingWar avatar Sep 03 '22 12:09 PolarisRisingWar

Thanks for reporting. Do you have some intuition what might cause this? Is there a memory leak and memory requirements are increasing over epochs? Any guidance appreciated!

rusty1s avatar Sep 05 '22 06:09 rusty1s

Many workers accumulate variables may lead to out of memory? I guess.

LukeLIN-web avatar Oct 03 '22 21:10 LukeLIN-web