deep-person-reid
deep-person-reid copied to clipboard
Error while extracting features from gallery set
encountered these error while using custom dataset. filename of imgs in the dataset looks like this. "path/to/dataset/IN_HPID_SN4_CAMID_11503.png" code I used is as below and it worked fine with implemented datasets. and datamanager loaded custom dataset successfully when it declared. is there something i might missed?
epoch: [10/150][570/582] time 0.159 (0.197) data 0.000 (0.006) eta 4:27:21 loss 1.7876 (1.7792) acc 87.8906 (87.9955) lr 0.001500 epoch: [10/150][580/582] time 0.167 (0.197) data 0.000 (0.006) eta 4:27:27 loss 1.8199 (1.7797) acc 87.8906 (87.9815) lr 0.001500
Evaluating koreanreidimage (source)
Extracting features from query set ... Done, obtained 119361-by-512 matrix Extracting features from gallery set ... Traceback (most recent call last): File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 872, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/usr/local/lib/python3.7/queue.py", line 179, in get self.not_empty.wait(remaining) File "/usr/local/lib/python3.7/threading.py", line 300, in wait gotit = waiter.acquire(True, timeout) File "/home/user01/.local/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 17465) is killed by signal: Killed.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./koreanreidimage.py", line 118, in
from future import absolute_import, print_function, division
import sys import os import os.path as osp import glob
import torchreid from torchreid.data import ImageDataset
import torch import torch.nn as nn
class KoreanReidImage(ImageDataset): dataset_dir = 'koreanreidimage'
def __init__(self, root='', **kwargs):
self.root = osp.abspath(osp.expanduser(root))
self.dataset_dir = osp.join(self.root, self.dataset_dir)
self.train_dir = osp.join(self.dataset_dir, 'Train')
self.query_dir = osp.join(self.dataset_dir, 'Validation')
self.gallery_dir = osp.join(self.dataset_dir, 'All')
required_files = [
self.dataset_dir, self.train_dir, self.query_dir, self.gallery_dir
]
self.check_before_run(required_files)
self.query_pids = set()
train = self.process_dir(self.train_dir, mode='train')
query = self.process_dir(self.query_dir, mode='query')
gallery = self.process_dir(self.gallery_dir, mode='gallery')
super(KoreanReidImage, self).__init__(train, query, gallery, **kwargs)
def process_dir(self, dir_path, mode='train'):
img_paths = glob.glob(dir_path+'/**/*.png', recursive=True)
pid_container = set()
for img_path in img_paths:
img_name = img_path.split('/')[-1]
pid = int(img_name.split('_')[1][1:])
if mode=='train':
pid_container.add(pid)
elif mode=='query':
self.query_pids.add(pid)
if mode=='train':
pid2label = {pid: label for label, pid in enumerate(pid_container)}
data = []
for img_path in img_paths:
img_name = img_path.split('/')[-1]
pid = int(img_name.split('_')[1][1:])
if mode=='gallery':
if pid not in self.query_pids:
continue
camid = int(img_name.split('_')[3])
if mode=='train':
pid = pid2label[pid]
data.append((img_path, pid, camid))
return data
torchreid.data.register_image_dataset('koreanreidimage', KoreanReidImage)
datamanager = torchreid.data.ImageDataManager( root='/home/user01/_data1/reid-data', sources='koreanreidimage', targets='koreanreidimage', height=256, width=128, batch_size_train=256, batch_size_test=256, transforms=['random_flip', 'random_erase'] )
model = torchreid.models.build_model( name='osnet_x1_0', num_classes=datamanager.num_train_pids, loss='softmax', pretrained=True, use_gpu=True )
model = nn.DataParallel(model).cuda()
optimizer = torchreid.optim.build_optimizer( model, optim='amsgrad', lr=0.0015 )
scheduler = torchreid.optim.build_lr_scheduler( optimizer, lr_scheduler='single_step', stepsize=60, gamma=0.1 )
engine = torchreid.engine.ImageSoftmaxEngine( datamanager, model, optimizer=optimizer, scheduler=scheduler, use_gpu=True, label_smooth=True )
engine.run( save_dir='log/koreanreidimage/osnet/finetune/', max_epoch= 150, eval_freq=10, print_freq=10, test_only=False, fixbase_epoch=10, open_layers=['classifier'] )
Building train transforms ...
-
resize to 256x128
-
random flip
-
to torch tensor of range [0, 1]
-
normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
-
random erase Building test transforms ...
-
resize to 256x128
-
to torch tensor of range [0, 1]
-
normalization (mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) => Loading train (source) dataset => Loaded KoreanReidImage
subset | # ids | # images | # cameras
train | 502 | 149122 | 206 query | 500 | 119361 | 190 gallery | 500 | 2362447 | 198
=> Loading test (target) dataset => Loaded KoreanReidImage
subset | # ids | # images | # cameras
train | 502 | 149122 | 206 query | 500 | 119361 | 190 gallery | 500 | 2362447 | 198
**************** Summary **************** source : ['koreanreidimage'] source datasets : 1 source ids : 502 source images : 149122 source cameras : 206 target : ['koreanreidimage']
It looks like a pytorch issue.
The size of the dataset caused this issue. Loading gallery dataset with 4 workers consumes all 128GB of memory size and the system crashed. Changing workers to 0 made loading really slow but somehow solved this issue.
But writing the extracted feature vector (512 x 2362447) on memory caused another memory issue. Is there any way I can solve this problem?
The easy answer is always using better GPU 🤣. However, you can try to splits the data to smaller subsets and test one by one