examples icon indicating copy to clipboard operation
examples copied to clipboard

Parameter Server Distributed RPC example is limited to only one worker.

Open panchul opened this issue 5 years ago • 12 comments
trafficstars

With the recent fix, rpc/parameter_server example works as is, in one-master/one-worker configuration. However, if I change the world_size to 3(or anything higher than 2), the example does not load-balance the batches. All three instances of rpc_parameter_server.py hang. Here are the steps to reproduce(instantiating one worker per machine):

On one machine I ran a master(rank 0) and a worker(rank1). In one terminal:

server1:~/src/pytorch-examples/distributed/rpc/parameter_server$ python rpc_parameter_server.py --world_size=3 --rank=0 --master_addr=123.45.67.89 --master_port=29555 --num_gpus=0
PS master initializing RPC
RPC initialized! Running parameter server...
Using 0 GPUs to train
Putting first 2 convs on cpu
Putting rest of layers on cpu

In another:

server1:~/src/pytorch-examples/distributed/rpc/parameter_server$ python rpc_parameter_server.py --world_size=3 --rank=1 --master_addr=123.45.67.89 --master_port=29555 --num_gpus=0
Processing...
/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
Worker rank 1 initializing RPC
Worker 1 done initializing RPC

And on a separate machine the third worker (rank 2):

server2:~/src/pytorch-examples/distributed/rpc/parameter_server$ python rpc_parameter_server.py --world_size=3 --rank=2 --master_addr=123.45.67.89 --master_port=29555 --num_gpus=0
Processing...
/pytorch/torch/csrc/utils/tensor_numpy.cpp:141: UserWarning: The given NumPy array is not writeable, and PyTorch does not support non-writeable tensors. This means you can write to the underlying (supposedly non-writeable) NumPy array using the tensor. You may want to copy the array to protect its data or make it writeable before converting it to a tensor. This type of warning will be suppressed for the rest of this program.
Done!
Worker rank 2 initializing RPC
Worker 2 done initializing RPC

panchul avatar May 26 '20 17:05 panchul

@panchul I can only run parameter server on two nodes. When using three nodes, error arises.

jinalong avatar Jun 03 '20 06:06 jinalong

@jinalong , thanks for confirming.

panchul avatar Jun 03 '20 06:06 panchul

I Get this error

Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [64, 32, 3, 3]] is at version 3; expected version 2 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

@panchul , Are you getting the similar error

mayurvaid avatar Jun 07 '20 11:06 mayurvaid

@mayurvaid I got similar error

jinalong avatar Jun 07 '20 13:06 jinalong

Ditto

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "rpc_parameter_server.py", line 224, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "rpc_parameter_server.py", line 183, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CUDAFloatType [128, 10]], which is output 0 of TBackward, is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

ollien avatar Nov 04 '20 22:11 ollien

cc @rohan-varma

mrshenli avatar Nov 10 '20 23:11 mrshenli

I got similar problems: RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [CPUFloatType [64, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). Have you ever solved it? @mayurvaid @ollien @panchul

Weigaa avatar Dec 14 '20 08:12 Weigaa

@osalpekar

Weigaa avatar Dec 14 '20 08:12 Weigaa

Hi,

The underlying cause for this issue is due to concurrent updates during the backwards/optimizer step portion, which we are currently working on debugging. Essentially, this error means that a weight has been updated by the optimizer from another node while the backwards pass is running at the same time on the PS. If you need to unblock, a (wip) fix over at https://github.com/pytorch/examples/pull/842 effectively removes this issue by serializing the workers.

rohan-varma avatar Dec 14 '20 08:12 rohan-varma

same error evionment infomation

Is debug build:False CUDA used to build PyTorch:11.3 ROCM used to build PyTorch:N/A o5:Ubuntu18.04.6LT5(x86_64) GCC version:(ubuntu 7.5.0-3ubuntul~18.04)7.5.0 clang version:Could not collectCMake version:Could not collect Libc version:glibc-2.17 Python version:3.7.11 (default,Jul 27 2021,14:32:16)[GCC 7.5.0](64-bit runtime) Python platform:Linux-3.10.0-1062.4.3.e17.x86_64-x86_64-with-debian-buster-sid Is CUDA available:True CUDA runtime version:11.3.109 CUDA MODULE_LOADING set to: GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe GPU 1: NVIDIA A100 80GB PCIe GPU 2: NVIDIA A100 80GB PCIe GPU 3: NVIDIA A100 80GB PCIe Nvidia driver version:515.65.01 cuDNN version:Probably one of the following:/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.0 /usr/lib/x86_64-1inux-gnu/libcudnn_adv_infer.so.8.2.0/usr/lib/x86_64-1inux-gnu/libcudnn_adv_train.so.8.2.0/usr/lib/x86_64-1inux-gnu/libcudnn_cnn_infer.so.8.2.0/usr/lib/x86_64-1inux-gnu/libcudnn_cnn_train.so.8.2.0/usr/lib/x86_64-1inux-gnu/libcudnn_ops_infer.so.8.2.0/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.0HIP runtime version:N/AMIopen runtime version:N/A Is XNNPACK available:True versions of relevant libraries:[pip3]numpy==1.21.5 pip3]numpydoc==1.2 [pip3]torch==1.10.2+cu113 [pip3]torchtext==0.11.2 [pip3]torchvision==0.11.3+cu113 [conda]No relevant packages

` from torchtext.vocab import build_vocab_from_iterator from torchtext.data.utils import get_tokenizer from data.wikitext2_data import WikiText2 import random import argparse import time import math from torch.nn import TransformerEncoder, TransformerEncoderLayer import torch import torch.distributed as dist import torch.distributed.autograd as dist_autograd import torch.distributed.rpc as rpc import torch.multiprocessing as mp import torch.optim as optim from torch import nn from torch.distributed.nn import RemoteModule from torch.distributed.optim import DistributedOptimizer from torch.distributed.rpc import RRef from torch.distributed.rpc import TensorPipeRpcBackendOptions from torch.nn.parallel import DistributedDataParallel as DDP torch.autograd.set_detect_anomaly(True)

NUM_EMBEDDINGS = 100 EMBEDDING_DIM = 2 batch_size = 100 num_workers = 2

train_iter, val_iter, test_iter = WikiText2()

total_loss = 0. train_iter, val_iter, test_iter = WikiText2(root='./data') tokenizer = get_tokenizer('basic_english') vocab = build_vocab_from_iterator( map(tokenizer, train_iter), specials=[""]) vocab.set_default_index(vocab[""]) ntokens = len(vocab) # the size of vocabulary emsize = 4096 # embedding dimension nhid = 4096 # the dimension of the feedforward network model in nn.TransformerEncoder nlayers = 8 # the number of nn.TransformerEncoderLayer in nn.TransformerEncoder nhead = 16 # the number of heads in the multiheadattention models dropout = 0.2 # the dropout value

def data_process(raw_text_iter): data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter] return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

train_iter, val_iter, test_iter = WikiText2() train_data = data_process(train_iter) val_data = data_process(val_iter) test_data = data_process(test_iter)

class PositionalEncoding(nn.Module):

def __init__(self, d_model, dropout=0.1, max_len=5000):
    super(PositionalEncoding, self).__init__()
    self.dropout = nn.Dropout(p=dropout)

    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(
        0, d_model, 2).float() * (-math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    pe = pe.unsqueeze(0).transpose(0, 1)
    self.register_parameter('pe', nn.Parameter(pe, requires_grad=False))

def forward(self, x):
    x_ = x + self.pe[:x.size(0), :]
    return self.dropout(x_)

class Encoder(nn.Module): def init(self, ntoken, ninp, dropout=0.5): super(Encoder, self).init() self.pos_encoder = PositionalEncoding(ninp, dropout) self.encoder = nn.Embedding(ntoken, ninp) self.ninp = ninp self.init_weights()

def init_weights(self):
    initrange = 0.1
    self.encoder.weight.data.uniform_(-initrange, initrange)

def forward(self, src):

    src_ = src.t()
    src__ = self.encoder(src_) * math.sqrt(self.ninp)
    return self.pos_encoder(src__).cpu()

class Decoder(nn.Module): def init(self, ntoken, ninp): super(Decoder, self).init() self.decoder = nn.Linear(ninp, ntoken) self.init_weights()

def init_weights(self):
    initrange = 0.1
    self.decoder.bias.data.zero_()
    self.decoder.weight.data.uniform_(-initrange, initrange)

def forward(self, inp):
    # Need batch dimension first for output of pipeline.

    return self.decoder(inp).permute(1, 0, 2).contiguous().view(-1, ntokens)

class MID(nn.Module): def init(self, emsize, nhead, nhid, dropout, device): super(MID, self).init() tmp_list = [] nlayers = 1 self.emsize = emsize for i in range(nlayers): transformer_block = TransformerEncoderLayer( emsize, nhead, nhid, dropout) tmp_list.append(transformer_block) self.rnn0 = nn.Sequential(*tmp_list) self.device = device # self.init_weights()

def init_weights(self):
    initrange = 0.1
    self.decoder.bias.data.zero_()
    self.decoder.weight.data.uniform_(-initrange, initrange)

def forward(self, inp):
    # Need batch dimension first for output of pipeline.
    # if torch.cuda.is_available():
    if type(inp) in (tuple, list):
        # print("*"*100, inp[0].device, self.device)
        # device = inp[0].device

        inp = torch.cat([torch.unsqueeze(t, dim=0)
                        for t in inp]).cuda(self.device)  # .reshape(-1, self.emsize)
        # inp = inp.cuda(0)
    return self.rnn0(inp)  # .cpu()

class RNNModel(nn.Module): r""" A distributed RNN model which puts embedding table and decoder parameters on a remote parameter server, and locally holds parameters for the LSTM module. The structure of the RNN model is borrowed from the word language model example. See https://github.com/pytorch/examples/blob/main/word_language_model/model.py """

def __init__(self, remote_emb_module, device='cpu'):
    super(RNNModel, self).__init__()
    # setup embedding table remotely
    self.remote_emb_module = remote_emb_module
    # setup LSTM locally
    self.decoder_rref = DDP(Decoder(ntokens, emsize).cuda(device),
                            device_ids=[device])
    self.device = device

def forward(self, input):
    # pass input to the remote embedding table and fetch emb tensor back
    emb = self.remote_emb_module[0].forward(input)
    emb = self.remote_emb_module[1].forward(emb)
    # emb = self.remote_emb_module[2].forward(emb)
    if type(emb) in (tuple, list):
        emb = torch.cat([torch.unsqueeze(t, dim=0)
                        for t in emb])
    # print("#"*100, emb.device, self.device, self.decoder_rref.device)
    # print("#"*100, emb.device, self.device)
    return self.decoder_rref(emb.cuda(self.device))

class HybridModel(torch.nn.Module): r""" The model consists of a sparse part and a dense part. 1) The dense part is an nn.Linear module that is replicated across all trainers using DistributedDataParallel. 2) The sparse part is a Remote Module that holds an nn.EmbeddingBag on the parameter server. This remote model can get a Remote Reference to the embedding table on the parameter server. """

def __init__(self, remote_emb_module, device):
    print(f"Init HybridModel on device {device}")
    super(HybridModel, self).__init__()
    self.remote_emb_module = remote_emb_module
    self.fc = DDP(torch.nn.Linear(EMBEDDING_DIM, 8).cuda(device),
                  device_ids=[device])
    self.device = device

def forward(self, indices):
    print(self.remote_emb_module[0].device, indices[:, 0].device)
    print(self.remote_emb_module[1].device, indices[:, 1].device)

    emb_lookup1 = self.remote_emb_module[0].forward(indices[:, 0])
    emb_lookup2 = self.remote_emb_module[1].forward(indices[:, 1])
    if type(emb_lookup1) in (tuple, list):
        emb_lookup1 = torch.cat(
            emb_lookup1, dim=0).reshape(-1, EMBEDDING_DIM)
    if type(emb_lookup2) in (tuple, list):
        emb_lookup2 = torch.cat(
            emb_lookup2, dim=0).reshape(-1, EMBEDDING_DIM)

    # print(f"curr rank: {torch.distributed.get_rank()}, "
    #       f"self.device: {self.device}, ")

    return self.fc(
        emb_lookup1.cuda(self.device) + emb_lookup2.cuda(self.device))

def _run_trainer(remote_emb_module, rank): r""" Each trainer runs a forward pass which involves an embedding lookup on the parameter server and running nn.Linear locally. During the backward pass, DDP is responsible for aggregating the gradients for the dense part (nn.Linear) and distributed autograd ensures gradients updates are propagated to the parameter server. """

# model = RNNModel( emsize, ntokens, nhid, nlayers)
print(f"Setup the model on {rank}")
model = RNNModel(remote_emb_module=remote_emb_module, device=rank)

model_parameter_rrefs = []
for rm in model.remote_emb_module:
    model_parameter_rrefs += rm.remote_parameters()

for param in model.decoder_rref.parameters():
    model_parameter_rrefs.append(RRef(param))
device = torch.device("cpu")

def batchify(data, bsz):
    # Divide the dataset into bsz parts.
    nbatch = data.size(0) // bsz
    # Trim off any extra elements that wouldn't cleanly fit (remainders).
    data = data.narrow(0, 0, nbatch * bsz)
    # Evenly divide the data across the bsz batches.
    data = data.view(bsz, -1).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
global train_data, val_data, test_data
train_data = batchify(train_data, batch_size)
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)
bptt = 35

def get_batch(source, i):
    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i+seq_len]
    target = source[i+1:i+1+seq_len].view(-1)
    # Need batch dimension first for pipeline parallelism.
    return data.t(), target

# setup distributed optimizer
opt = DistributedOptimizer(
    optim.SGD,
    model_parameter_rrefs,
    lr=0.1
)

criterion = torch.nn.CrossEntropyLoss()

# Train only for 50 batches to keep script execution time low.
nbatches = min(50 * bptt, train_data.size(0) - 1)
max_epochs = 3

def train():
    total_loss = 0.0
    for batch, i in enumerate(range(0, nbatches, bptt)):
        data, targets = get_batch(train_data, i)
        targets = targets.to(rank)
        with dist_autograd.context() as context_id:
            # optimizer.zero_grad()
            # Since the Pipe is only within a single host and process the ``RRef``
            # returned by forward method is local to this node and can simply
            # retrieved via ``RRef.local_value()``.
            output = model(data)
            # Need to move targets to the device where the output of the
            # pipeline resides.
            # print('*'*100)
            # print(output.shape,ntokens,targets.shape)

            loss = criterion(
                output, targets)

            dist_autograd.backward(context_id, [loss])
            # torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
            opt.step(context_id)
            # print(loss)
            total_loss += loss.item()
            log_interval = 10
            if batch % log_interval == 0 and batch > 0:
                cur_loss = total_loss / log_interval
                elapsed = time.time() - start_time
                print('| epoch {:3d} | {:5d}/{:5d} batches | '
                      'lr {:02.2f} | ms/batch {:5.2f} | '
                      'loss {:5.2f} | ppl {:8.2f}'.format(
                          epoch, batch, nbatches // bptt, 5,
                          elapsed * 1000 / log_interval,
                          cur_loss, math.exp(cur_loss)))
                total_loss = 0
            start_time = time.time()

for epoch in range(1, max_epochs + 1):
    epoch_start_time = time.time()
    train()
    # val_loss = evaluate(model, val_data)
    print('-' * 89)
    # print('| end of epoch {:3d} | time: {:5.2f}s | valid loss {:5.2f} | '
    #       'valid ppl {:8.2f}'.format(epoch, (time.time() - epoch_start_time),
    #                                  val_loss, math.exp(val_loss)))
    print('-' * 89)

def run_worker(rank, world_size): r""" A wrapper function that initializes RPC, calls the function, and shuts down RPC. """

rpc_backend_options = TensorPipeRpcBackendOptions()
rpc_backend_options.init_method = "tcp://localhost:29501"
rpc_backend_options.rpc_timeout = 60000*5
if rank == num_workers:
    rpc.init_rpc(
        "master",
        rank=rank,
        world_size=world_size,
        rpc_backend_options=rpc_backend_options,
    )
    remote_emb_module = [
        RemoteModule(
            "trainer0/cuda:0",
            Encoder,
            args=(ntokens, emsize, dropout),
            kwargs={},
        ),
        RemoteModule(
            "trainer1/cuda:1",
            MID,
            args=(emsize, nhead, nhid, dropout, 1),
            kwargs={},
        ),
        # RemoteModule(
        #     "trainer2/cuda:2",
        #     MID,
        #     args=(emsize, nhead, nhid, dropout, 2),
        #     kwargs={},
        # ),
    ]

    # Run the training loop on trainers.
    futs = []
    for trainer_rank in range(num_workers):
        trainer_name = "trainer{}".format(trainer_rank)
        fut = rpc.rpc_async(trainer_name,
                            _run_trainer,
                            args=(remote_emb_module, trainer_rank))
        futs.append(fut)

    # Wait for all training to finish.
    for fut in futs:
        fut.wait()
elif rank < num_workers:
    # Initialize process group for Distributed DataParallel on trainers.
    dist.init_process_group(backend="nccl",
                            rank=rank,
                            world_size=num_workers,
                            init_method="tcp://localhost:29500")

    # Initialize RPC.
    trainer_name = "trainer{}".format(rank)
    worker_rpc_backend_options = TensorPipeRpcBackendOptions()
    worker_rpc_backend_options.init_method = "tcp://localhost:29501"
    worker_rpc_backend_options.rpc_timeout = 60000*5
    for remote_rank in range(num_workers):
        if remote_rank != rank:
            worker_rpc_backend_options.set_device_map(f"trainer{remote_rank}",
                                                      {rank: remote_rank})
    rpc.init_rpc(
        trainer_name,
        rank=rank,
        world_size=world_size,
        rpc_backend_options=worker_rpc_backend_options,
    )

# Trainer just waits for RPCs from master.

# block until all rpcs finish
rpc.shutdown()

def main(args): run_worker(int(args.rank), int(args.world_size))

if name == "main": # world_size = 1 + num_workers # mp.spawn(run_worker, args=(world_size,), nprocs=world_size, join=True) argparser = argparse.ArgumentParser("training") argparser.add_argument('--rank', default='0', type=str) argparser.add_argument('--world_size', default=3, type=int)

args = argparser.parse_args()
main(args)

` reference #60440 https://github.com/pytorch/pytorch/issues/60440

allendred avatar Nov 24 '22 14:11 allendred

Hi, I'm facing the same issue with rpc_parameter_server.py example. Has this issue been resolved? I set --world_size=3 for 3 nodes of isolated docker containers and it shows the following error messages.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fed/rpc_parameter_server.py", line 231, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "/home/fed/rpc_parameter_server.py", line 190, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1, 3, 3]] is at version 6; expected version 5 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Exception raised from unpack at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/autograd/saved_variable.cpp:184 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f96e916c457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f96e91363ec in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torch::autograd::SavedVariable::unpack(std::shared_ptr<torch::autograd::Node>) const + 0x7fc (0x7f971f969f2c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::generated::ConvolutionBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0xfd (0x7f971ee154fd in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x43a2f8b (0x7f971f93df8b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1638 (0x7f971f937878 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::distributed::autograd::DistEngine::execute_graph_task_until_ready_queue_empty(torch::autograd::NodeTask&&, bool) + 0x434 (0x7f9720374a64 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4dda536 (0x7f9720375536 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x14a8e51 (0x7f971ca43e51 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x285 (0x7f96e915dd85 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0xdbbf4 (0x7f9731a4dbf4 in /opt/conda/lib/python3.10/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #11: <unknown function> + 0x76db (0x7f97678cb6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f9766e4f61f in /lib/x86_64-linux-gnu/libc.so.6

HeywardLiu avatar Mar 10 '23 19:03 HeywardLiu

Hi, I'm facing the same issue with rpc_parameter_server.py example. Has this issue been resolved? I set --world_size=3 for 3 nodes of isolated docker containers and it shows the following error messages.

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/fed/rpc_parameter_server.py", line 231, in run_worker
    run_training_loop(rank, num_gpus, train_loader, test_loader)
  File "/home/fed/rpc_parameter_server.py", line 190, in run_training_loop
    dist_autograd.backward(cid, [loss])
RuntimeError: Error on Node 0: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 1, 3, 3]] is at version 6; expected version 5 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Exception raised from unpack at /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/autograd/saved_variable.cpp:184 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f96e916c457 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f96e91363ec in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: torch::autograd::SavedVariable::unpack(std::shared_ptr<torch::autograd::Node>) const + 0x7fc (0x7f971f969f2c in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: torch::autograd::generated::ConvolutionBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0xfd (0x7f971ee154fd in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: <unknown function> + 0x43a2f8b (0x7f971f93df8b in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&) + 0x1638 (0x7f971f937878 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: torch::distributed::autograd::DistEngine::execute_graph_task_until_ready_queue_empty(torch::autograd::NodeTask&&, bool) + 0x434 (0x7f9720374a64 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: <unknown function> + 0x4dda536 (0x7f9720375536 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x14a8e51 (0x7f971ca43e51 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #9: c10::ThreadPool::main_loop(unsigned long) + 0x285 (0x7f96e915dd85 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #10: <unknown function> + 0xdbbf4 (0x7f9731a4dbf4 in /opt/conda/lib/python3.10/site-packages/torch/lib/../../../../libstdc++.so.6)
frame #11: <unknown function> + 0x76db (0x7f97678cb6db in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #12: clone + 0x3f (0x7f9766e4f61f in /lib/x86_64-linux-gnu/libc.so.6

This problem cannot be solved, and the only choice is to manually place the model layer on different graphics cards to achieve mixed parallelism.

allendred avatar Mar 22 '23 06:03 allendred