deep-high-resolution-net.pytorch
deep-high-resolution-net.pytorch copied to clipboard
Error when training on MPII dataset
I managed to run inference on several images successfully.
However, when i am trying to train again on MPII data using the example command line:
python tools/train.py \
--cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml \
I get the following error:
Traceback (most recent call last):
File "/hdd/deep-high-resolution-net.pytorch/tools/train.py", line 223, in <module>
main()
File "/hdd/deep-high-resolution-net.pytorch/tools/train.py", line 111, in main
writer_dict['writer'].add_graph(model, (dump_input, ))
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/tensorboardX/writer.py", line 738, in add_graph
self._get_file_writer().add_graph(graph(model, input_to_model, verbose, **kwargs))
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 240, in graph
trace = torch.jit.trace(model, args)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 772, in trace
check_tolerance, _force_outplace, _module_class)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 898, in trace_module
module = make_module(mod, _module_class, _compilation_unit)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 669, in make_module
return _module_class(mod, _compilation_unit=_compilation_unit)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
self._modules[name] = TracedModule(submodule, id_set)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
self._modules[name] = TracedModule(submodule, id_set)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
self._modules[name] = TracedModule(submodule, id_set)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
self._modules[name] = TracedModule(submodule, id_set)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
self._modules[name] = TracedModule(submodule, id_set)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
original_init(self, *args, **kwargs)
File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1855, in __init__
assert(isinstance(orig, torch.nn.Module))
AssertionError
I have the same problem as you. Have you solved itοΌ
Same problem here.
same problem
Reference: #98 by downgrade tesorboardX version to 1.6 and all works well.
which pytorch version do you use? I downgrade tesorboardX version to 1.6, another issue occured as below: @TeeboneTing @leoxiaobin
=> init weights from normal distribution
=> loading pretrained model models/pytorch/imagenet/hrnet_w32-36af842e.pth
Traceback (most recent call last):
File "tools/train.py", line 223, in
I use pytorch==1.2 torchvision==0.4, Does anyone have similar experience? no problem when running "python tools/test.py", in coco/mpii
it works when torch==1.0, tensorboardX==1.6
@dzyjjpy My version: torch==1.0.0 torchvision==0.2.1 tensorboardX==1.6
@dzyjjpy hello, can i ask you one issue?
Thank your issue. My version: pytorch == 1.2.0, torchvision == 0.4.0, tensorbloardx = 1.6. It works, but other bugs comes. π
@TeeboneTing thanks. @eng100200 pls describe your issue.
@GZP123 have you run the test code on mpii for w32?
@GZP123 have you run the test code on mpii for w32?
I do not. I have the similar bug with @dzyjjpy π . And I can not run, let me try to so it more.
@GZP123 have you run the test code on mpii for w32?
Now, my version is: pytorch == 1.0.1, torchvision == 0.2.1, tensorboardx = 1.6. And I can run the train code. For the test code, I want to wait some days.
@dzyjjpy hello, i want to ask few questions;
- Why we flipped images in testing the mpii dataset?
- I think mpii provide bounding box, but, i am not sure. However, how did you get the bounding boxes for mpii and how you computed center and scale?
- Further, the validation set of mpii used for testing do you single person detection or multi-person detection?
- how can i display images after i detect the pose points, i mean disply with pose points?
- how can i test any image of my own?
@GZP123 did you train using mpii?
@GZP123 did you train using mpii?
I do it and the speed is enough. But I can not train for MS COCO, because of my GPUs. I have ubuntu 16.0 and 2 GTX 1080 GPUs. Maybe I need to modify the batch size if I want to train for MS COCO.
@ZP-Guo thanks for reply. I am also planning to train using mpii, but, i would do it for compact model, model have less than 8G flops....and also my target is to train for sixteen points only. Can i ask you one question, why the scale is multipled by 1.25 when center has positive value? if center[0] != -1: scale = scale * 1.25
@ZP-Guo thanks for reply. I am also planning to train using mpii, but, i would do it for compact model, model have less than 8G flops....and also my target is to train for sixteen points only. Can i ask you one question, why the scale is multipled by 1.25 when center has positive value? if center[0] != -1: scale = scale * 1.25
So sorry I can not give you an answer because I do not pay much attention to code details. I think the author often set some value empirically, so I can not explain it. For example, HRNet is trained for face datasets. You know, HRNet can complete many tasks, such as human pose estimation and face detection. Someone asks a question how to get the "center" and "scale" values in an issue that comes from face detection task. And I find the author set a value "200" empirically. This issue is found by my friend and he told me when we chat so that I do not save it. What I want to say is that maybe he, the author sets some value empirically, not some reasons for calculation. Maybe many problems that we can not solve come. But do not give up and we will do it. Come on, bro. π
@ZP-Guo i understand your answer...i would try to go deeper,,,and if i find answer i would share with you. However, i am using HRNET for pose estimation.
@ZP-Guo i understand your answer...i would try to go deeper,,,and if i find answer i would share with you. However, i am using HRNET for pose estimation.
Maybe I understand CNN not enough. For example, why VGG-16 has 64, 128, 256 channels. I think the author set them empirically and I can not explain it via using calculation. I also use CNN for human pose estimations, as you. And I will thank you for sharing if you would share what you find.
@ZP-Guo shake hands,,,,i think num of channels question is just a empricial choice....it could have many reasoning,,,like cost of memory,...computation.....enough information from these number of channels..... do you have wechat?
@ZP-Guo shake hands,,,,i think num of channels question is just a empricial choice....it could have many reasoning,,,like cost of memory,...computation.....enough information from these number of channels..... do you have wechat?
I mean why they choose 64, 128, 256, and so on. 128 = 64 * 2, 256 = 128 * 2. Why not it is 61, 122, 244 or 55, 110, 220. But it does not matter. Maybe it is okay that I know how to use it. And we can exhange experience via e-mail. You can get the e-mail on my homepage. π
@ZP-Guo @dzyjjpy @alex-razor have you recieved error "raise RuntimeError("Failed to export an ONNX attribute, " RuntimeError: Failed to export an ONNX attribute, since it's not constant, please try to make things (e.g., kernel size) static if possible" during training using mpii dataset? i have installed pytorch=1.1.0 and tensorboradX =1.6? should i upgrade tensorboardX?
@ZP-Guo @dzyjjpy @alex-razor have you recieved error "raise RuntimeError("Failed to export an ONNX attribute, " RuntimeError: Failed to export an ONNX attribute, since it's not constant, please try to make things (e.g., kernel size) static if possible" during training using mpii dataset? i have installed pytorch=1.1.0 and tensorboradX =1.6? should i upgrade tensorboardX?
So sorry, I have not met this bug. Maybe you can update your environment according to my version.
sorry could tell total version about this work, i want to run train coco dataSet but always appear cpu_nms.py", line 12 cimport numpy as np
@dzyjjpy hello, i want to ask few questions;
- Why we flipped images in testing the mpii dataset?
- I think mpii provide bounding box, but, i am not sure. However, how did you get the bounding boxes for mpii and how you computed center and scale?
- Further, the validation set of mpii used for testing do you single person detection or multi-person detection?
- how can i display images after i detect the pose points, i mean disply with pose points?
- how can i test any image of my own?
Hello, did you find answers for your questions? I'm also curious.
@dzyjjpy you mean flip for testing? part 2 to 5 i can answer your questions
@dzyjjpy you mean flip for testing? part 2 to 5 i can answer your questions
Can you please answer part 2 to 5
None
in nn.ModuleList
break the JIT in higher version of PyTorch. This problem is resolved by this issue. https://github.com/pytorch/pytorch/issues/30459.