dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[GraphBolt] Cannot re-initialize CUDA in forked subprocess

Open Rhett-Ying opened this issue 1 year ago • 10 comments

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. python3 examples/sampling/graphbolt/node_classification.py --num-workers 4
 File "/opt/conda/envs/dgl-dev-gpu-dgl-0/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 183, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/base.py", line 209, in __iter__
    data = recursive_apply(data, apply_to, self.device)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/utils/internal.py", line 1135, in recursive_apply
    return fn(data, *args, **kwargs)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/base.py", line 145, in apply_to
    return x.to(device) if hasattr(x, "to") else x
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 496, in to
    setattr(self, attr, apply_to(getattr(self, attr), device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 462, in apply_to
    return recursive_apply(x, lambda x: _to(x, device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/utils/internal.py", line 1135, in recursive_apply
    return fn(data, *args, **kwargs)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 462, in <lambda>
    return recursive_apply(x, lambda x: _to(x, device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 459, in _to
    return x.to(device) if hasattr(x, "to") else x
  File "/opt/conda/envs/dgl-dev-gpu-dgl-0/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
This exception is thrown by __iter__ of CopyTo(datapipe=ShardingFilterIterDataPipe, device=device(type='cuda'), extra_attrs=['seed_nodes'])

Expected behavior

Environment

  • DGL Version (e.g., 1.0): master
  • Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
  • OS (e.g., Linux):
  • How you installed DGL (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version (if applicable):
  • GPU models and configuration (e.g. V100):
  • Any other relevant information:

Additional context

Rhett-Ying avatar Jan 19 '24 00:01 Rhett-Ying

I think this is due to not passing --mode=cpu-cuda.

mfbalin avatar Jan 19 '24 00:01 mfbalin

Do you think we should automatically set args.mode = cpu-cuda if the user passes num-workers>0?

mfbalin avatar Jan 19 '24 01:01 mfbalin

cpu-cuda works well

Rhett-Ying avatar Jan 19 '24 01:01 Rhett-Ying

pinned-cuda is mutually exclusive with num_workers > 0?

Rhett-Ying avatar Jan 19 '24 01:01 Rhett-Ying

Yes, for GPU sampling, num_workers has to be 0. I believe. Is there any use case where passing a value greater than 0 would be useful when using the GPU for sampling?

mfbalin avatar Jan 19 '24 01:01 mfbalin

then please throw an exception if contradiction happens.

Rhett-Ying avatar Jan 19 '24 01:01 Rhett-Ying

Hmm, I think we need to work on the dataloader argument checking overall. We could check where the CopyTo is, whether the feature store is pinned or is on the device, whether the graph is pinned or is on the device etc.

mfbalin avatar Jan 19 '24 01:01 mfbalin

However, how do you think we can check the graph in a general manner? The user might pass sample_neighbors, sample_layer_neighbors or any other custom datapipes. The features could be custom too.

mfbalin avatar Jan 19 '24 01:01 mfbalin

@frozenbugs any insights here?

mfbalin avatar Jan 23 '24 05:01 mfbalin

However, how do you think we can check the graph in a general manner? The user might pass sample_neighbors, sample_layer_neighbors or any other custom datapipes. The features could be custom too.

I am not sure whether I understand this comment, but for the discussion before this one, I think the error msg reported by Rui is not very bad, if we want to clarify more, adding a python try-catch to wrap the call of copy_to and clarify the error msg should be enough.

frozenbugs avatar Jan 25 '24 06:01 frozenbugs