yolact icon indicating copy to clipboard operation
yolact copied to clipboard

RuntimeError: DataLoader worker (pid 1620) is killed by signal: Killed.

Open Kracozebr opened this issue 2 years ago • 3 comments

I'm trying to train yolact in google colab on my custom dataset and get the following error:

yield from torch.randperm(n, generator=generator).tolist()
RuntimeError: Expected a 'cuda' device type for generator but found 'cpu'

I changed the following code in train.py:

data_loader = data.DataLoader(dataset, args.batch_size,
                                 num_workers=args.num_workers,
                                 shuffle=True, collate_fn=detection_collate,
                                 pin_memory=True)

To:

data_loader = data.DataLoader(dataset, args.batch_size,
                                 num_workers=args.num_workers,
                                 shuffle=True, collate_fn=detection_collate,
                                 generator=torch.Generator(device='cuda'),
                                 pin_memory=True)

But, unfortunately I got the follow error:

loading annotations into memory...
Done (t=0.34s)
creating index...
index created!
loading annotations into memory...
Done (t=0.14s)
creating index...
index created!
/usr/local/lib/python3.7/dist-packages/torch/jit/_recursive.py:222: UserWarning: 'lat_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
/usr/local/lib/python3.7/dist-packages/torch/jit/_recursive.py:222: UserWarning: 'pred_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
/usr/local/lib/python3.7/dist-packages/torch/jit/_recursive.py:222: UserWarning: 'downsample_layers' was found in ScriptModule constants,  but it is a non-constant submodule. Consider removing it.
  " but it is a non-constant {}. Consider removing it.".format(name, hint))
Initializing weights...
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
Begin training!

/content/yolact/utils/augmentations.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  mode = random.choice(self.sample_options)
/content/yolact/utils/augmentations.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  mode = random.choice(self.sample_options)
/content/yolact/utils/augmentations.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  mode = random.choice(self.sample_options)
/content/yolact/utils/augmentations.py:309: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  mode = random.choice(self.sample_options)
tcmalloc: large alloc 7694368768 bytes == 0x5607bccf6000 @  0x7f4d4a805001 0x7f4cee4c654f 0x7f4cee516b58 0x7f4cee51ab17 0x7f4cee5b9203 0x56071b0f80a4 0x56071b0f7da0 0x56071b16c868 0x56071b0f9b99 0x56071b13ce79 0x56071b0f87b2 0x56071b16be65 0x56071b166c35 0x56071b0f9dd1 0x56071b13ce79 0x56071b0f87b2 0x56071b16c6f2 0x56071b0f9b99 0x56071b13ce79 0x56071b0f87b2 0x56071b16c6f2 0x56071b167235 0x56071b0f973a 0x56071b167d67 0x56071b0fb6db 0x56071b13c439 0x56071b13c3ac 0x56071b1e0119 0x56071b16807e 0x56071b166dcc 0x56071b0f973a
[  0]       0 || B: 6.483 | C: 13.374 | M: 6.231 | S: 3.235 | T: 29.322 || ETA: 33 days, 8:00:06 || timer: 36.000
Traceback (most recent call last):
  File "train.py", line 505, in <module>
    train()
  File "train.py", line 308, in train
    losses = net(datum)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "train.py", line 146, in forward
    losses = self.criterion(self.net, preds, targets, masks, num_crowds)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/yolact/layers/modules/multibox_loss.py", line 159, in forward
    ret = self.lincomb_mask_loss(pos, idx_t, loc_data, mask_data, priors, proto_data, masks, gt_box_t, score_data, inst_data, labels)
  File "/content/yolact/layers/modules/multibox_loss.py", line 546, in lincomb_mask_loss
    pos_idx_t = idx_t[idx, cur_pos]
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 1620) is killed by signal: Killed. 

Any suggestions?

Kracozebr avatar Sep 07 '21 14:09 Kracozebr

change num_worker =0 in train.py. For me it worked!

denashamss avatar Nov 23 '21 23:11 denashamss

change num_worker =0 in train.py. For me it worked! thanks ,it works

aodeluo avatar Feb 08 '22 08:02 aodeluo

change num_worker =0 in train.py. For me it worked!

I followed these steps and now I get this error, can you help me?

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

letessarini avatar Dec 05 '23 16:12 letessarini