CrowdDet icon indicating copy to clipboard operation
CrowdDet copied to clipboard

AssertionError in training

Open SuperTyrael opened this issue 5 years ago • 8 comments

I met this assertionError when I was training this model. Can you guys help me?

Traceback (most recent call last):
  File "/anaconda3/envs/fasterRCNN/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/CrowdDet/tools/train.py", line 109, in train_worker
    do_train_epoch(net, data_iter, optimizer, rank, epoch_id, train_config)
  File "/CrowdDet/tools/train.py", line 58, in do_train_epoch
    assert torch.isfinite(total_loss).all(), outputs
AssertionError: {'loss_rpn_cls': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rpn_loc': tensor(inf, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rcnn_loc': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_rcnn_cls': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>)}```

SuperTyrael avatar Jul 22 '20 05:07 SuperTyrael

Try several times, sometimes this error raises at the beginning of training.

xg-chu avatar Jul 23 '20 05:07 xg-chu

Try several times, sometimes this error raises at the beginning of training.

Sorry, but we've tried several times and just get the same error in the almost same iteration in the first epoch

SuperTyrael avatar Jul 31 '20 11:07 SuperTyrael

Have you modified the code or data? Such mistakes rarely occur. Try changing the dataset initialization sequence or decreasing the learning rate.

xg-chu avatar Jul 31 '20 12:07 xg-chu

I have a another AssertionError in training. Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2,
train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27] Init multi-processing training... Traceback (most recent call last): File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in run_train() File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train multi_train(args, config, Network) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config)) File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker crowdhuman = CrowdHuman(config, if_train=True) File "../lib/data/CrowdHuman.py", line 20, in init self.records = misc_utils.load_json_lines(source) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError

chjXu avatar Oct 21 '20 08:10 chjXu

I have a another AssertionError in training. Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2, train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27] Init multi-processing training... Traceback (most recent call last): File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in run_train() File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train multi_train(args, config, Network) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config)) File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker crowdhuman = CrowdHuman(config, if_train=True) File "../lib/data/CrowdHuman.py", line 20, in init self.records = misc_utils.load_json_lines(source) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError

Looks like the annotation file path is wrong, Check the "train_source" and "eval_source" in config.py.

xg-chu avatar Oct 28 '20 07:10 xg-chu

I have a another AssertionError in training. Can you help me?

Num of GPUs:3, learning rate:0.00750, mini batch size:2, train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27] Init multi-processing training... Traceback (most recent call last): File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in run_train() File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train multi_train(args, config, Network) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config)) File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker crowdhuman = CrowdHuman(config, if_train=True) File "../lib/data/CrowdHuman.py", line 20, in init self.records = misc_utils.load_json_lines(source) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

yaru-w avatar Nov 11 '20 08:11 yaru-w

I have a another AssertionError in training. Can you help me? Num of GPUs:3, learning rate:0.00750, mini batch size:2, train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27] Init multi-processing training... Traceback (most recent call last): File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in run_train() File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train multi_train(args, config, Network) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config)) File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker crowdhuman = CrowdHuman(config, if_train=True) File "../lib/data/CrowdHuman.py", line 20, in init self.records = misc_utils.load_json_lines(source) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

检查一下文件到底在不在那个路径就可以了。

xg-chu avatar Nov 11 '20 15:11 xg-chu

I have a another AssertionError in training. Can you help me? Num of GPUs:3, learning rate:0.00750, mini batch size:2, train_epoch:30, iter_per_epoch:2500, decay_epoch:[24, 27] Init multi-processing training... Traceback (most recent call last): File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 174, in run_train() File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 171, in run_train multi_train(args, config, Network) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 155, in multi_train torch.multiprocessing.spawn(train_worker, nprocs=num_gpus, args=(train_config, network, config)) File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception: -- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/xcj/anaconda3/envs/py_cpn/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/media/xcj/data/xcj/CrowdDet/tools/train.py", line 102, in train_worker crowdhuman = CrowdHuman(config, if_train=True) File "../lib/data/CrowdHuman.py", line 20, in init self.records = misc_utils.load_json_lines(source) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError

你好,请问你的问题解决了吗,我好像也遇到了类似的问题,关于load_json_lines的问题,找不到json_file的路径

检查一下文件到底在不在那个路径就可以了。

嗯嗯,我是在运行这行代码时:python3 eval_json.py -f your_json_path.json 遇到了以下错误: Traceback (most recent call last): File "eval_json.py", line 36, in run_eval() File "eval_json.py", line 33, in run_eval eval_all(args) File "eval_json.py", line 19, in eval_all res_line, JI = compute_JI.evaluation_all(args.json_file, 'box') File "../lib/evaluate/compute_JI.py", line 18, in evaluation_all records = misc_utils.load_json_lines(path) File "../lib/utils/misc_utils.py", line 11, in load_json_lines assert os.path.exists(fpath) AssertionError 谢谢您。我不清楚your_json_path.json是什么,所以也找不到它的位置,但是打开result_eval.md后,里面有一行your_json_path.json,但我还是不知道怎么修复这个错误,或许您有什么意见吗?

yaru-w avatar Nov 12 '20 02:11 yaru-w