dgl AttributeError: 'list' object has no attribute 'local

🐛 Bug

When I run dgl\examples\pytorch\graphsage\dist\train_dist.py on GPUs as the file README.md, it works fine, but when changing the network layer of the model the following problem occurs:

AttributeError: 'list' object has no attribute 'local_scope'

To Reproduce

Steps to reproduce the behavior:

The model can be trained well by running the following command. The code in workspace is copied from dgl\examples\pytorch\graphsage\dist\ .

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 train_dist.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 1000 --num_gpus 1 --backend nccl"

Change the network layer in dgl\examples\pytorch\graphsage\dist\train_dist.py in the following way

# GAT
class DistGAT(nn.Module):
    def __init__(
        self, in_feats, n_hidden, n_classes, heads
        # n_layers, activation, dropout
    ):
        super().__init__()
        self.gat_layers = nn.ModuleList()
        # two-layer GAT
        self.gat_layers.append(
            dglnn.GATConv(
                in_feats,
                n_hidden,
                heads[0],
                feat_drop=0.6,
                attn_drop=0.6,
                activation=F.elu,
            )
        )
        self.gat_layers.append(
            dglnn.GATConv(
                in_feats * heads[0],
                n_classes,
                heads[1],
                feat_drop=0.6,
                attn_drop=0.6,
                activation=None,
            )
        )

    def forward(self, g, inputs):
        h = inputs
        for i, layer in enumerate(self.gat_layers):
            h = layer(g, h)
            if i == 1:  # last layer
                h = h.mean(1)
            else:  # other layer(s)
                h = h.flatten(1)
        return h

def run(args, device, data):
    ...
    # Define model and optimizer
    model = DistGAT(
        in_feats,
        args.num_hidden,
        n_classes,
        heads=[8, 1]
    )
        # args.num_layers,
        # F.relu,
        # args.dropout,
    # )
    ...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gat-2-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The cluster starts as expected and then the following problem occurs

Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'

GCN is probably more similar to sage. If you make the same changes, the same message will appear

# GCN
class DistGCN(nn.Module):
    def __init__(self, in_size, hid_size, out_size):
        super().__init__()
        self.layers = nn.ModuleList()
        # two-layer GCN
        self.layers.append(
            dglnn.GraphConv(in_size, hid_size, activation=F.relu)
        )
        self.layers.append(dglnn.GraphConv(hid_size, out_size))
        self.dropout = nn.Dropout(0.5)

    def forward(self, g, features):
        h = features
        for i, layer in enumerate(self.layers):
            if i != 0:
                h = self.dropout(h)
            h = layer(g, h)
        return h

def run(args, device, data):
    ...
    # Define model and optimizer
    # model = GCN(
    #     in_feats,
    #     args.num_hidden,
    #     n_classes,
    #     args.num_layers,
    #     F.relu,
    #     args.dropout,
    # )
    model = DistGCN(in_feats, 16, n_classes).to(device)
    ...

execute

/home/tyc/anaconda3/envs/gnn/bin/python3 ~/workspace/graphsage/launch.py \
--workspace ~/workspace/graphsage/ \
--num_trainers 1 \
--num_samplers 0 \
--num_servers 1 \
--part_config data2-ogb-product/ogb-product.json \
--ip_config ip_config.txt \
"/home/tyc/anaconda3/envs/gnn/bin/python3 gcn-dist-change_model.py --graph_name ogb-product --ip_config ip_config.txt --num_epochs 3 --batch_size 10000 --num_gpus 1 --backend nccl"

The information obtained is

Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[3] in group[0] is exiting...
Traceback (most recent call last):
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 413, in <module>
    main(args)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 358, in main
    run(args, device, data)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 215, in run
    batch_pred = model(blocks, batch_inputs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/workspace/graphsage/gcn-dist-change_model.py", line 45, in forward
    h = layer(g, h)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tyc/anaconda3/envs/gnn/lib/python3.9/site-packages/dgl/nn/pytorch/conv/graphconv.py", line 405, in forward
    with graph.local_scope():
AttributeError: 'list' object has no attribute 'local_scope'
Client[0] in group[0] is exiting...

Expected behavior

Apply distributed training to the training of other models, e.g. GAT, GCN, GIN, etc.

Environment

DGL Version (e.g., 1.0): DGL 2.1.0
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): pytorch 2.1.0
OS (e.g., Linux): ubuntu 20.04
How you installed DGL (conda, pip, source): conda
Build command you used (if compiling from source):
Python version: Python 3.9.18
CUDA/cuDNN version (if applicable): cuda_12.1.0_530.30.02_linux
GPU models and configuration (e.g. V100): The graphics card on one machine is a GeForce RTX 2060 SUPER and the graphics card on the other machine is a GeForce GTX 1660 SUPER.
Any other relevant information: I train the above models on a local cluster consisting of two computers and have not migrated them to the cloud yet. There is a different graphics card on each of these two computers.

Additional context

After reviewing the documentation on docs.dgl.ai, I am still unclear on how to resolve the following error:

AttributeError: 'list' object has no attribute 'local_scope'

The code in the dgl/examples/pytorch/graphsage/dist file is quite enlightening, and I am interested in expanding it to incorporate additional models. Any guidance you could offer would be greatly appreciated.

The command that executes the training has a few more parameters or paths than the command in README.md because the following problems occurs：

Probably because I installed dgl in conda's virtual environment, if I don't add a path to python3, there will be

ModuleNotFoundError: No module named 'numpy'

or

ModuleNotFoundError: No module named 'dgl'

If I use the default "--backend" parameter gloo, it comes up with

 [E ProcessGroupGloo.cpp:138] Gloo connectFullMesh failed with [/opt/conda/conda-bld/pytorch_1695392035629/work/third_party/gloo/gloo/transport/tcp/pair.cc:144] no error

I have no idea how to solve this.

Once again, thank you for your exceptional work!

Apr 10 '24 03:04 tyccc22

@Rhett-Ying do we have DistDGL examples?

Apr 11 '24 02:04 BarclayII

please refer to non-dist version of GAT/GCN models such as https://github.com/dmlc/dgl/tree/master/examples/pytorch/gat to make sure it's runnable. Model code should be same both in DistDGL and non-dist.

A better suggestion for running various model with distributed training/inference is utilizing GraphStorm which offers high level APIs.

Apr 11 '24 03:04 Rhett-Ying

Thanks for your advice. Since the "Gloo connectFullMesh failed with..." error is not resolved, I am trying to train some models from https://github.com/dmlc/dgl/tree/master/examples/pytorch/ on 2 machines.

Also, I would like to ask about dataset partitioning. When dividing the dataset with https://github.com/dmlc/dgl/tree/master/examples/pytorch/graphsage/dist/partition_graph.py, the memory size required is several times the size of the dataset. Are there any corresponding optimisations for memory, or are other tools provided?

Apr 15 '24 15:04 tyccc22

Are there any corresponding optimisations for memory, or are other tools provided?

Unfortunately there's no much optimization available for the partition stage. dgl.distributed.partition_graph() is the most convenient API that is available for now. But we also support partition graph with distributed pipeline if you have multiple machines with small CPU RAM. please refer to here for more details. This partition pipeline requires some more additional preprocesses.

Apr 15 '24 23:04 Rhett-Ying

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

May 16 '24 01:05 github-actions[bot]

Hi, I am closing this issue assuming you are happy about our response. Feel free to follow up and reopen the issue if you have more questions with regard to our response.

May 23 '24 01:05 frozenbugs

dgl dgl copied to clipboard

AttributeError: 'list' object has no attribute 'local_scope'

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

dgl
dgl copied to clipboard