CCD
CCD copied to clipboard
RuntimeError: DataLoader worker (pid 32978) is killed by signal: Aborted.
Hi, I am unable to run CCD, it seems that the Memory usage spikes immediately. I have reduced num_workers
to 0, and also reduced batch size to 8 and also used fp16. But it seems to me that the Dataloader is creating problem only while running train.py
. I was able to run 'test.py` on ARD model.
Hi, I am unable to run CCD, it seems that the Memory usage spikes immediately. I have reduced
num_workers
to 0, and also reduced batch size to 8 and also used fp16. But it seems to me that the Dataloader is creating problem only while runningtrain.py
. I was able to run 'test.py` on ARD model.
Can you describe your problem in detail, including hardware configuration and reported problem?
OS: Ubuntu 22.04.4 LTS x86_64 GPU: NVIDIA GeForce GTX 1050 Ti Mobile (Cuda : 12.2) RAM: 16 GB
I have followed the instructions for installation with torch==1.10.0+cu113
and other similar dependencies. I tried inferencing the ARD model on my dataset, and it works fine. But when I try to train a model using 'train.py', I get an error message that the dataloader processes have been killed by signal.
This is my CCD_pretrain_ViT_Base.yaml :
global:
name: pre_base_65536
phase: train
stage: pretrain-vision
workdir: workdir
seed: ~
output_dir: './saved_models/'
dataset:
scheme: selfsupervised_kmeans
type: ST
train: {
roots: [
'/home/aakash01/Desktop/parseq/results/train/real',
# 'xxx/data_lmdb/training/label/Synth',
# 'xxx/data_lmdb/training/URD/OCR-CC',
],
}
valid: {
roots: [
'/home/aakash01/Desktop/parseq/results/val',
# 'xxx/data_lmdb/validation',
],
}
test: {
roots: [
'/home/aakash01/Desktop/parseq/results/test',
# 'xxx/data_lmdb/evaluation/benchmark',
# 'xxx/data_lmdb/evaluation/addition',
],
}
data_aug: True
multiscales: False
mask: False
num_workers: 8
augmentation_severity: 5
charset_path: './Dino/data/charset_95.txt'
mask_path: None #'xxx/data_lmdb/Mask'
training:
epochs: 3
start_iters: 0
show_iters: 200
eval_iters: 3000
save_iters: 50000
model:
name: 'Dino.model.dino_vision.ABIDINOModel'
seg_channel: 512
checkpoint: ~
mp:
num: 1
arch: 'vit_base'
patch_size: 4
out_dim: 65536
#Not normalizing leads to better performance but can make the training unstable.
#In our experiments, we typically set this paramater to False with vit_small and True with vit_base."""
norm_last_layer: True
#We recommend setting a higher value with small batches: for example use 0.9995 with batch size of 256.
momentum_teacher: 0.9995
#Initial value for the teacher temperature: 0.04 works well in most cases.
#Try decreasing it if the training loss does not decrease.
warmup_teacher_temp: 0.04
#We recommend starting with the default value of 0.04 and increase this slightly if needed.
teacher_temp: 0.04
#Number of warmup epochs for the teacher temperature (Default: 30).
warmup_teacher_temp_epochs: 0
batch_size_per_gpu: 8
#The learning rate is linearly scaled with the batch size, and specified here for a reference batch size of 256.
lr: 0.0005
#Clipping with norm .3 ~ 1.0 can help optimization for larger ViT architectures.
clip_grad: 3.0
use_bn_in_head: False
use_fp16: False
weight_decay: 0.04
weight_decay_end: 0.4
epochs: 100
freeze_last_layer: 1
warmup_epochs: 10
min_lr: 0.000001
optimizer: adamw
drop_path_rate: 0.1
global_crops_scale: (0.4, 1.)
local_crops_number: 8
crops_number: 2
local_crops_scale: (0.05, 0.4)
seed: 0
num_workers: 8
dist_url: "env://"
local_rank: 0
saveckp_freq: 10
warmup_epoch: 10
imgnet_based: 1000000
I have disabled using masks, the num_workers
is set to 8. The error message that I receive is :
Fatal Python error: Cannot recover from stack overflow.
Current thread 0x00007fc3770c5740 (most recent call first):
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/_util.py", line 6 in is_path
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/ImageFile.py", line 103 in __init__
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/JpegImagePlugin.py", line 822 in jpeg_factory
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/Image.py", line 3263 in _open_core
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/PIL/Image.py", line 3277 in open
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 143 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 92 in _next_image
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 166 in get
...
Traceback (most recent call last):
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/queue.py", line 179, in get
self.not_empty.wait(remaining)
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/threading.py", line 300, in wait
gotit = waiter.acquire(True, timeout)
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 21829) is killed by signal: Aborted.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "train.py", line 457, in <module>
train(config)
File "train.py", line 187, in train
for (image_tensors, masks, metrics) in metric_logger.log_every(train_dataloader, 10, header):
File "/home/aakash01/Desktop/CCD/Dino/modules/utils.py", line 388, in log_every
for obj in iterable:
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1142, in _get_data
success, data = self._try_get_data()
File "/home/aakash01/anaconda3/envs/CCD/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 21829) exited unexpectedly
I have read about this error, and it seems to be related to Memory usage being extremely high. Though I am not sure why this is happening, because I have worked with LMDB datasets being used in Dataloaders in PARSeq model. I have tried reducing the batch size
to 1 and also setting 'num_workers' to 8, but the error still persists.
Also can you explain what the 'mp' flag in the config file means? I first want to try training the model on my local machine and then put a batch job on the HPC cluster.
File "/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py", line 143 in get
I read the issue.
I think you should put a breakpoint on line 166 (/home/aakash01/Desktop/CCD/Dino/dataset/dataset.py) to check whether there are errors in reading the data.
mp is a multi-process marker, which was later deprecated. You should ignore it.
I met the same problem as you, we use almost the same yaml file, have you solve this?
I met the same problem as you, we use almost the same yaml file, have you solve this?
issue from https://github.com/TongkunGuan/CCD/blob/0b6bdf9415a0d33d7e7a9adac21d9036d915709d/Dino/dataset/dataset.py#L133
def get(self, idx): with self.env.begin(write=False) as txn: image_key, label_key = f'image-{idx + 1:09d}', f'label-{idx + 1:09d}' try: imgbuf = txn.get(image_key.encode()) # image buf = six.BytesIO() buf.write(imgbuf) buf.seek(0) with warnings.catch_warnings(): warnings.simplefilter("ignore", UserWarning) # EXIF warning from TiffPlugin image = PIL.Image.open(buf).convert(self.convert_mode) with self.mask_env.begin(write=False) as mask_txn: mask_key = f'mask-{idx + 1:09d}' try: maskbuf = mask_txn.get(mask_key.encode()) # image mask_buf = six.BytesIO() mask_buf.write(maskbuf) mask_buf.seek(0) mask = PIL.Image.open(mask_buf).convert('L') except: print(f"Corrupted image for {idx}") mask = np.zeros((self.img_w, self.img_h)) if self.is_training and not self._check_image(image): return self._next_image() except: return self._next_image() return image, mask, idx
You should check the correctness before the self._next_image(). Do you add the mask_env file?