dgl
dgl copied to clipboard
AttributeError: 'list' object has no attribute 'local_scope'
🐛 Bug
When I run dgl\examples\pytorch\graphsage\dist\train_dist.py on GPUs as the file README.md, it works fine, but when changing the network layer of the model the following problem occurs:
AttributeError: 'list' object has no attribute 'local_scope'
To Reproduce
Steps to reproduce the behavior:
- The model can be trained well by running the following command. The code in workspace is copied from dgl\examples\pytorch\graphsage\dist\ .
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 1 --backend nccl"
- Change the network layer in dgl\examples\pytorch\graphsage\dist\train_dist.py in the following way
# GAT
class DistGAT(nn.Module):
def __init__(
self, in_feats, n_hidden, n_classes, heads
# n_layers, activation, dropout
):
super().__init__()
self.gat_layers = nn.ModuleList()
# two-layer GAT
self.gat_layers.append(
dglnn.GATConv(
in_feats,
n_hidden,
heads[0],
feat_drop=0.6,
attn_drop=0.6,
activation=F.elu,
)
)
self.gat_layers.append(
dglnn.GATConv(
in_feats * heads[0],
n_classes,
heads[1],
feat_drop=0.6,
attn_drop=0.6,
activation=None,
)
)
def forward(self, g, inputs):
h = inputs
for i, layer in enumerate(self.gat_layers):
h = layer(g, h)
if i == 1: # last layer
h = h.mean(1)
else: # other layer(s)
h = h.flatten(1)
return h
def run(args, device, data):
...
# Define model and optimizer
model = DistGAT(
in_feats,
args.num_hidden,
n_classes,
heads=[8, 1]
)
# args.num_layers,
# F.relu,
# args.dropout,
# )
...
execute
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gat-2-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"
The cluster starts as expected and then the following problem occurs
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
- GCN is probably more similar to sage. If you make the same changes, the same message will appear
# GCN
class DistGCN(nn.Module):
def __init__(self, in_size, hid_size, out_size):
super().__init__()
self.layers = nn.ModuleList()
# two-layer GCN
self.layers.append(
dglnn.GraphConv(in_size, hid_size, activation=F.relu)
)
self.layers.append(dglnn.GraphConv(hid_size, out_size))
self.dropout = nn.Dropout(0.5)
def forward(self, g, features):
h = features
for i, layer in enumerate(self.layers):
if i != 0:
h = self.dropout(h)
h = layer(g, h)
return h
def run(args, device, data):
...
# Define model and optimizer
# model = GCN(
# in_feats,
# args.num_hidden,
# n_classes,
# args.num_layers,
# F.relu,
# args.dropout,
# )
model = DistGCN(in_feats, 16, n_classes).to(device)
...
execute
/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gcn-dist-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"
The information obtained is
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
main(args)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
run(args, device, data)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
batch_pred = model(blocks, batch_inputs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
h = layer(g, h)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[0] in group[0] is exiting...
Expected behavior
Apply distributed training to the training of other models, e.g. GAT, GCN, GIN, etc.
Environment
- DGL Version (e.g., 1.0): DGL 2.1.0
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 2.1.0
- OS (e.g., Linux): ubuntu 20.04
- How you installed DGL (
conda
,pip
, source): conda - Build command you used (if compiling from source):
- Python version: Python 3.9.18
- CUDA/cuDNN version (if applicable): cuda_12.1.0_530.30.02_linux
- GPU models and configuration (e.g. V100): The graphics card on one machine is a GeForce RTX 2060 SUPER and the graphics card on the other machine is a GeForce GTX 1660 SUPER.
- Any other relevant information: I train the above models on a local cluster consisting of two computers and have not migrated them to the cloud yet. There is a different graphics card on each of these two computers.
Additional context
After reviewing the documentation on docs.dgl.ai, I am still unclear on how to resolve the following error:
AttributeError: 'list' object has no attribute 'local_scope'
The code in the dgl/examples/pytorch/graphsage/dist file is quite enlightening, and I am interested in expanding it to incorporate additional models. Any guidance you could offer would be greatly appreciated.
The command that executes the training has a few more parameters or paths than the command in README.md because the following problems occurs:
- Probably because I installed dgl in conda's virtual environment, if I don't add a path to python3, there will be
ModuleNotFoundError: No module named 'numpy'
or
ModuleNotFoundError: No module named 'dgl'
- If I use the default "--backend" parameter gloo, it comes up with
[E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1695392035629/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error
I have no idea how to solve this.
Once again, thank you for your exceptional work!
@Rhett-Ying do we have DistDGL examples?
please refer to non-dist version of GAT/GCN models such as https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat to make sure it's runnable. Model code should be same both in DistDGL and non-dist.
A better suggestion for running various model with distributed training/inference is utilizing GraphStorm which offers high level APIs.
Thanks for your advice. Since the "Gloo connectFullMesh failed with..." error is not resolved, I am trying to train some models from https://github.com/dmlc/dgl/tree/master/examples/pytorch/ on 2 machines.
Also, I would like to ask about dataset partitioning. When dividing the dataset with https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist/partition_graph.py, the memory size required is several times the size of the dataset. Are there any corresponding optimisations for memory, or are other tools provided?
Are there any corresponding optimisations for memory, or are other tools provided?
Unfortunately there's no much optimization available for the partition stage. dgl.distributed.partition_graph()
is the most convenient API that is available for now. But we also support partition graph with distributed pipeline if you have multiple machines with small CPU RAM. please refer to here for more details. This partition pipeline requires some more additional preprocesses.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.