AlignedReID-Re-Production-Pytorch train_ml.py -d '((0,),(0,))' Error

当在同一个GPU上运行train_ml.py（-d ((0,),(0,))）时出现如下错误，到底是那里的问题呢，谢谢～～

Ep 1, 37.78s, gp 23.85%, gm 17.38%, gd_ap 13.3094, gd_an 12.2454, gL 1.4902, gdmL 1.1416, loss 2.6318 Exception in thread Thread-6: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "script/experiment/train_ml.py", line 475, in thread_target normalize_feature=cfg.normalize_feature) File "./aligned_reid/model/loss.py", line 215, in global_loss dist_mat, labels, return_inds=True) File "./aligned_reid/model/loss.py", line 163, in hard_example_mining dist_mat[is_neg].contiguous().view(N, -1), 1, keepdim=True) #1 RuntimeError: invalid argument 2: size '[32 x -1]' is invalid for input with 1095010584 elements at /pytorch/torch/lib/TH/THStorage.c:37

Mar 27 '18 02:03 zouliangyu

你好，这个问题我也看不出来什么原因，我猜你这里应该是每个batch8个人，每个人4张图片吧，错误信息里边的数1095010584 很奇怪。

Mar 29 '18 14:03 huanghoujing

我也是遇到这个问题，用 train_ml.py在同一个GPU上训练两个model就会报这个错，分别在market1501和duke数据集上，均会出现此问题（cuhk03暂时没有测试），可能跟线程有关系，有时跑完第一个epoch不会报错，有时第一个epoch没跑完就会报错

duke test set

NO. Images: 19889 NO. IDs: 1110
NO. Query Images: 2228 NO. Gallery Images: 17661 NO. Multi-query Images: 0

Exception in thread Thread-5: Traceback (most recent call last): File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "script/experiment/train_ml.py", line 481, in thread_target normalize_feature=cfg.normalize_feature) File "./aligned_reid/model/loss.py", line 215, in global_loss dist_mat, labels, return_inds=True)
File "./aligned_reid/model/loss.py", line 163, in hard_example_mining dist_mat[is_neg].contiguous().view(N, -1), 1, keepdim=True) RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37 Exception in thread Thread-6: Traceback (most recent call last): File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "script/experiment/train_ml.py", line 481, in thread_target normalize_feature=cfg.normalize_feature) File "./aligned_reid/model/loss.py", line 215, in global_loss dist_mat, labels, return_inds=True) File "./aligned_reid/model/loss.py", line 175, in hard_example_mining ind[is_pos].contiguous().view(N, -1), 1, relative_p_inds.data) RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 992 elements at /pytorch/torch/lib/TH/THStorage.c:37

market1501 trainval set

NO. Images: 12936 NO. IDs: 751

loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl

market1501 test set

NO. Images: 31969 NO. IDs: 751 NO. Query Images: 3368 NO. Gallery Images: 15913 NO. Multi-query Images: 12688

Exception in thread Thread-5: Traceback (most recent call last): File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "script/experiment/train_ml.py", line 481, in thread_target normalize_feature=cfg.normalize_feature) File "./aligned_reid/model/loss.py", line 215, in global_loss dist_mat, labels, return_inds=True) File "./aligned_reid/model/loss.py", line 177, in hard_example_mining ind[is_neg].contiguous().view(N, -1), 1, relative_n_inds.data) RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37 Exception in thread Thread-6: Traceback (most recent call last): File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner self.run() File "/home/sobey123/anaconda3/lib/python3.6/threading.py", line 864, in run self._target(*self._args, **self._kwargs) File "script/experiment/train_ml.py", line 481, in thread_target normalize_feature=cfg.normalize_feature) File "./aligned_reid/model/loss.py", line 215, in global_loss dist_mat, labels, return_inds=True) File "./aligned_reid/model/loss.py", line 159, in hard_example_mining dist_mat[is_pos].contiguous().view(N, -1), 1, keepdim=True) RuntimeError: invalid argument 2: size '[128 x -1]' is invalid for input with 30752 elements at /pytorch/torch/lib/TH/THStorage.c:37

Apr 23 '18 14:04 ShiinaMitsuki

@zouliangyu @ShiinaMitsuki 多谢指出问题！请问有没有试过两个模型在不同GPU，会不会出现这个问题？我接下来有空debug一下。

Apr 24 '18 06:04 huanghoujing

@huanghoujing 暂时没有哈，我这边目前仅能用一个GPU，所以才发现了这个问题。

Apr 25 '18 01:04 ShiinaMitsuki

貌似是python版本原因，换回2.7版本没有出现这个问题，(lll￢ω￢)

May 01 '18 13:05 ShiinaMitsuki

@ShiinaMitsuki 但是这个issue的第一个post里边的错误信息显示用的是Python2.7

May 02 '18 08:05 huanghoujing

mutual learling 需要在两个GPU跑，一个跑不起来,内存没有及时释放，慢慢的显存爆了

Jun 13 '18 01:06 Gavin666Github

@Gavin666Github 想要在一个GPU跑的话，又不想降低batch size，那就两个模型得迭代更新，一个batch更新一个模型（另一个模型前传取得结果后把中间变量删掉），也不需要多线程那些东西了，代码得做改动。

Jun 13 '18 02:06 huanghoujing

@Gavin666Github 可我已经用mutual learning在同一个gpu上训练过好多次了，之后均没出现过问题，显存占用基本在95%左右，没有爆显存的情况

Jun 14 '18 06:06 ShiinaMitsuki

@huanghoujing @ShiinaMitsuki 谢谢指导！应该是内存不够大的前提下跑不起来，out of memory,其实我这边本机(单gpu)内存只有4G，单跑Global Loss是没问题的(batch size调的比较小)，但是同样的batch size跑 mutual learning 就不行了，直接换到服务器多个GPU跑了ok的，每个GPU分配的Memory =8G ,完美解决。

Jun 14 '18 07:06 Gavin666Github

@ShiinaMitsuki 你batch size = 32x4 的mutual learning可以在一张12G的卡跑起来吗？

Jun 17 '18 16:06 huanghoujing

@ShiinaMitsuki 请问下 mutual learning你在单个gpu上训练，直接设置-d ((0,),(0,))嘛？

Jun 20 '18 05:06 Coler1994

@huanghoujing 调小一点是可以的

Jun 20 '18 10:06 ShiinaMitsuki

@Coler1994 -d ((0,),(0,)) --num_models 2

Jun 20 '18 10:06 ShiinaMitsuki

@ShiinaMitsuki @huanghoujing 我刚试了单卡 mutual learning没出bug，一路跑下来了，厉害炸了

Jun 20 '18 11:06 Coler1994

上个图看看，两个GPU跑的效果 screenshot from 2018-06-21 08-39-00

Jun 21 '18 00:06 Gavin666Github

你好，我遇到了同样的问题，python2.7.15，请问楼主解决了吗，看 @ShiinaMitsuki 大神说换python2.7就解决了 @Coler1994 ，但是我们就是2.7出现了问题。。还那能请问一下您用的具体什么命令吗谢谢

Aug 19 '18 13:08 yuanding53

@Coler1994 请问你在python哪个版本下运行的，可否参考下你修改代码

Sep 05 '18 09:09 guanhuiyan

AlignedReID-Re-Production-Pytorch AlignedReID-Re-Production-Pytorch copied to clipboard

train_ml.py -d '((0,),(0,))' Error

duke test set

NO. Images: 19889 NO. IDs: 1110 NO. Query Images: 2228 NO. Gallery Images: 17661 NO. Multi-query Images: 0

market1501 trainval set

NO. Images: 12936 NO. IDs: 751

loading pickle file: /home/sobey123/code/project/AlignedReId/datasets/market1501/partitions.pkl

market1501 test set

NO. Images: 31969 NO. IDs: 751 NO. Query Images: 3368 NO. Gallery Images: 15913 NO. Multi-query Images: 12688

AlignedReID-Re-Production-Pytorch
AlignedReID-Re-Production-Pytorch copied to clipboard

NO. Images: 19889 NO. IDs: 1110
NO. Query Images: 2228 NO. Gallery Images: 17661 NO. Multi-query Images: 0