deep-person-reid icon indicating copy to clipboard operation
deep-person-reid copied to clipboard

Error while extracting features from gallery set

Open jungmin-lim opened this issue 3 years ago • 3 comments

encountered these error while using custom dataset. filename of imgs in the dataset looks like this. "path/to/dataset/IN_HPID_SN4_CAMID_11503.png" code I used is as below and it worked fine with implemented datasets. and datamanager loaded custom dataset successfully when it declared. is there something i might missed?

epoch: [10/150][570/582] time 0.159 (0.197) data 0.000 (0.006) eta 4:27:21 loss 1.7876 (1.7792) acc 87.8906 (87.9955) lr 0.001500 epoch: [10/150][580/582] time 0.167 (0.197) data 0.000 (0.006) eta 4:27:27 loss 1.8199 (1.7797) acc 87.8906 (87.9815) lr 0.001500

Evaluating koreanreidimage (source)

Extracting features from query set ... Done, obtained 119361-by-512 matrix Extracting features from gallery set ... Traceback (most recent call last): File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/local/lib/python3.7/queue.py", line 179, in get self.not_empty.wait(remaining) File "/usr/local/lib/python3.7/threading.py", line 300, in wait gotit = waiter.acquire(True, timeout) File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 17465) is killed by signal: Killed.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "./koreanreidimage.py", line 118, in open_layers=['classifier'] File "/home/user01/python/ljm/install/deep-person-reid-master/torchreid/engine/engine.py", line 207, in run ranks=ranks File "/home/user01/python/ljm/install/deep-person-reid-master/torchreid/engine/engine.py", line 335, in test rerank=rerank File "/home/user01/.local/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context return func(*args, **kwargs) File "/home/user01/python/ljm/install/deep-person-reid-master/torchreid/engine/engine.py", line 384, in _evaluate gf, g_pids, g_camids = _feature_extraction(gallery_loader) File "/home/user01/python/ljm/install/deep-person-reid-master/torchreid/engine/engine.py", line 363, in _feature_extraction for batch_idx, data in enumerate(data_loader): File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 435, in next data = self._next_data() File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1068, in _next_data idx, data = self._get_data() File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1024, in _get_data success, data = self._try_get_data() File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 885, in _try_get_data raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e RuntimeError: DataLoader worker (pid(s) 17465) exited unexpectedly terminate called without an active exception Aborted

from future import absolute_import, print_function, division

import sys import os import os.path as osp import glob

import torchreid from torchreid.data import ImageDataset

import torch import torch.nn as nn

class KoreanReidImage(ImageDataset): dataset_dir = 'koreanreidimage'

def __init__(self, root='', **kwargs):
    self.root = osp.abspath(osp.expanduser(root))
    self.dataset_dir = osp.join(self.root, self.dataset_dir)
    
    self.train_dir = osp.join(self.dataset_dir, 'Train')
    self.query_dir = osp.join(self.dataset_dir, 'Validation')
    self.gallery_dir = osp.join(self.dataset_dir, 'All')
    
    required_files = [
        self.dataset_dir, self.train_dir, self.query_dir, self.gallery_dir
    ]
    self.check_before_run(required_files)
    
    self.query_pids = set()
    train = self.process_dir(self.train_dir, mode='train')
    query = self.process_dir(self.query_dir, mode='query')
    gallery = self.process_dir(self.gallery_dir, mode='gallery')
    
    super(KoreanReidImage, self).__init__(train, query, gallery, **kwargs)
    
def process_dir(self, dir_path, mode='train'):
    img_paths = glob.glob(dir_path+'/**/*.png', recursive=True)
    
    pid_container = set()
    for img_path in img_paths:
        img_name = img_path.split('/')[-1]
        pid = int(img_name.split('_')[1][1:])
        if mode=='train':
            pid_container.add(pid)
        elif mode=='query':
            self.query_pids.add(pid)
    if mode=='train':
        pid2label = {pid: label for label, pid in enumerate(pid_container)}
    
    data = []
    for img_path in img_paths:
        img_name = img_path.split('/')[-1]
        pid = int(img_name.split('_')[1][1:])
        if mode=='gallery':
            if pid not in self.query_pids:
                continue
        camid = int(img_name.split('_')[3])

        if mode=='train':
            pid = pid2label[pid]
        data.append((img_path, pid, camid))
    
    return data

torchreid.data.register_image_dataset('koreanreidimage', KoreanReidImage)

datamanager = torchreid.data.ImageDataManager( root='/home/user01/_data1/reid-data', sources='koreanreidimage', targets='koreanreidimage', height=256, width=128, batch_size_train=256, batch_size_test=256, transforms=['random_flip', 'random_erase'] )

model = torchreid.models.build_model( name='osnet_x1_0', num_classes=datamanager.num_train_pids, loss='softmax', pretrained=True, use_gpu=True )

model = nn.DataParallel(model).cuda()

optimizer = torchreid.optim.build_optimizer( model, optim='amsgrad', lr=0.0015 )

scheduler = torchreid.optim.build_lr_scheduler( optimizer, lr_scheduler='single_step', stepsize=60, gamma=0.1 )

engine = torchreid.engine.ImageSoftmaxEngine( datamanager, model, optimizer=optimizer, scheduler=scheduler, use_gpu=True, label_smooth=True )

engine.run( save_dir='log/koreanreidimage/osnet/finetune/', max_epoch= 150, eval_freq=10, print_freq=10, test_only=False, fixbase_epoch=10, open_layers=['classifier'] )

Building train transforms ...

  • resize to 256x128

  • random flip

  • to torch tensor of range [0, 1]

  • normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

  • random erase Building test transforms ...

  • resize to 256x128

  • to torch tensor of range [0, 1]

  • normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) => Loading train (source) dataset => Loaded KoreanReidImage

    subset | # ids | # images | # cameras

    train | 502 | 149122 | 206 query | 500 | 119361 | 190 gallery | 500 | 2362447 | 198

=> Loading test (target) dataset => Loaded KoreanReidImage

subset | # ids | # images | # cameras

train | 502 | 149122 | 206 query | 500 | 119361 | 190 gallery | 500 | 2362447 | 198

**************** Summary **************** source : ['koreanreidimage'] source datasets : 1 source ids : 502 source images : 149122 source cameras : 206 target : ['koreanreidimage']


jungmin-lim avatar Aug 25 '21 00:08 jungmin-lim

It looks like a pytorch issue.

siyanhu avatar Sep 17 '21 10:09 siyanhu

The size of the dataset caused this issue. Loading gallery dataset with 4 workers consumes all 128GB of memory size and the system crashed. Changing workers to 0 made loading really slow but somehow solved this issue.

But writing the extracted feature vector (512 x 2362447) on memory caused another memory issue. Is there any way I can solve this problem?

jungmin-lim avatar Sep 17 '21 10:09 jungmin-lim

The easy answer is always using better GPU 🤣. However, you can try to splits the data to smaller subsets and test one by one

thuangb avatar Oct 08 '21 03:10 thuangb