dgl [GraphBolt] Cannot re-initialize CUDA in forked subprocess

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

python3 examples/sampling/graphbolt/node_classification.py --num-workers 4

 File "/opt/conda/envs/dgl-dev-gpu-dgl-0/lib/python3.10/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 183, in wrap_generator
    response = gen.send(None)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/base.py", line 209, in __iter__
    data = recursive_apply(data, apply_to, self.device)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/utils/internal.py", line 1135, in recursive_apply
    return fn(data, *args, **kwargs)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/base.py", line 145, in apply_to
    return x.to(device) if hasattr(x, "to") else x
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 496, in to
    setattr(self, attr, apply_to(getattr(self, attr), device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 462, in apply_to
    return recursive_apply(x, lambda x: _to(x, device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/utils/internal.py", line 1135, in recursive_apply
    return fn(data, *args, **kwargs)
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 462, in <lambda>
    return recursive_apply(x, lambda x: _to(x, device))
  File "/home/ubuntu/workspace/dgl_0/python/dgl/graphbolt/minibatch.py", line 459, in _to
    return x.to(device) if hasattr(x, "to") else x
  File "/opt/conda/envs/dgl-dev-gpu-dgl-0/lib/python3.10/site-packages/torch/cuda/__init__.py", line 284, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
This exception is thrown by __iter__ of CopyTo(datapipe=ShardingFilterIterDataPipe, device=device(type='cuda'), extra_attrs=['seed_nodes'])

Expected behavior

Environment

DGL Version (e.g., 1.0): master
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3):
OS (e.g., Linux):
How you installed DGL (conda, pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version (if applicable):
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

Jan 19 '24 00:01 Rhett-Ying

I think this is due to not passing --mode=cpu-cuda.

Jan 19 '24 00:01 mfbalin

Do you think we should automatically set args.mode = cpu-cuda if the user passes num-workers>0?

Jan 19 '24 01:01 mfbalin

cpu-cuda works well

Jan 19 '24 01:01 Rhett-Ying

pinned-cuda is mutually exclusive with num_workers > 0?

Jan 19 '24 01:01 Rhett-Ying

Yes, for GPU sampling, num_workers has to be 0. I believe. Is there any use case where passing a value greater than 0 would be useful when using the GPU for sampling?

Jan 19 '24 01:01 mfbalin

then please throw an exception if contradiction happens.

Jan 19 '24 01:01 Rhett-Ying

Hmm, I think we need to work on the dataloader argument checking overall. We could check where the CopyTo is, whether the feature store is pinned or is on the device, whether the graph is pinned or is on the device etc.

Jan 19 '24 01:01 mfbalin

However, how do you think we can check the graph in a general manner? The user might pass sample_neighbors, sample_layer_neighbors or any other custom datapipes. The features could be custom too.

Jan 19 '24 01:01 mfbalin

@frozenbugs any insights here?

Jan 23 '24 05:01 mfbalin

However, how do you think we can check the graph in a general manner? The user might pass sample_neighbors, sample_layer_neighbors or any other custom datapipes. The features could be custom too.

I am not sure whether I understand this comment, but for the discussion before this one, I think the error msg reported by Rui is not very bad, if we want to clarify more, adding a python try-catch to wrap the call of copy_to and clarify the error msg should be enough.

Jan 25 '24 06:01 frozenbugs

dgl dgl copied to clipboard

[GraphBolt] Cannot re-initialize CUDA in forked subprocess

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

dgl
dgl copied to clipboard