deep-high-resolution-net.pytorch icon indicating copy to clipboard operation
deep-high-resolution-net.pytorch copied to clipboard

Error when training on MPII dataset

Open alex-razor opened this issue 4 years ago β€’ 32 comments

I managed to run inference on several images successfully.

However, when i am trying to train again on MPII data using the example command line:

python tools/train.py \
    --cfg experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml \

I get the following error:

Traceback (most recent call last):
  File "/hdd/deep-high-resolution-net.pytorch/tools/train.py", line 223, in <module>
    main()
  File "/hdd/deep-high-resolution-net.pytorch/tools/train.py", line 111, in main
    writer_dict['writer'].add_graph(model, (dump_input, ))
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/tensorboardX/writer.py", line 738, in add_graph
    self._get_file_writer().add_graph(graph(model, input_to_model, verbose, **kwargs))
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 240, in graph
    trace = torch.jit.trace(model, args)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 772, in trace
    check_tolerance, _force_outplace, _module_class)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 898, in trace_module
    module = make_module(mod, _module_class, _compilation_unit)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 669, in make_module
    return _module_class(mod, _compilation_unit=_compilation_unit)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
    self._modules[name] = TracedModule(submodule, id_set)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
    self._modules[name] = TracedModule(submodule, id_set)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
    self._modules[name] = TracedModule(submodule, id_set)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
    self._modules[name] = TracedModule(submodule, id_set)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1881, in __init__
    self._modules[name] = TracedModule(submodule, id_set)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1386, in init_then_register
    original_init(self, *args, **kwargs)
  File "/hdd/deep-high-resolution-net.pytorch/venv/lib/python3.6/site-packages/torch/jit/__init__.py", line 1855, in __init__
    assert(isinstance(orig, torch.nn.Module))
AssertionError

alex-razor avatar Aug 21 '19 15:08 alex-razor

I have the same problem as you. Have you solved it?

ahanahaner avatar Aug 27 '19 09:08 ahanahaner

Same problem here.

tomahawk810 avatar Aug 28 '19 15:08 tomahawk810

same problem

TeeboneTing avatar Sep 05 '19 06:09 TeeboneTing

Reference: #98 by downgrade tesorboardX version to 1.6 and all works well.

TeeboneTing avatar Sep 05 '19 06:09 TeeboneTing

which pytorch version do you use? I downgrade tesorboardX version to 1.6, another issue occured as below: @TeeboneTing @leoxiaobin

=> init weights from normal distribution => loading pretrained model models/pytorch/imagenet/hrnet_w32-36af842e.pth Traceback (most recent call last): File "tools/train.py", line 223, in main() File "tools/train.py", line 111, in main writer_dict['writer'].add_graph(model, (dump_input, )) File "/home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/tensorboardX/writer.py", line 566, in add_graph self.file_writer.add_graph(graph(model, input_to_model, verbose)) File "/home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 235, in graph _optimize_trace(trace, torch.onnx.utils.OperatorExportTypes.ONNX) File "/home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 175, in _optimize_trace trace.set_graph(_optimize_graph(trace.graph(), operator_export_type)) File "/home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/tensorboardX/pytorch_graph.py", line 201, in _optimize_graph torch._C.jit_pass_lower_all_tuples(graph) RuntimeError: kind.is_prim() INTERNAL ASSERT FAILED at /pytorch/torch/csrc/jit/ir.cpp:904, please report a bug to PyTorch. Only prim ops are allowed to not have a registered operator but aten::_convolution doesn't have one either. We don't know if this op has side effects. (hasSideEffects at /pytorch/torch/csrc/jit/ir.cpp:904) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7fd2f6c54273 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: torch::jit::Node::hasSideEffects() const + 0x2d5 (0x7fd215f162c5 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #2: + 0x3ca9923 (0x7fd215f7c923 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #3: + 0x3ca9a75 (0x7fd215f7ca75 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #4: torch::jit::EliminateDeadCode(torch::jit::Block*, bool, torch::jit::DCESideEffectPolicy) + 0x138 (0x7fd215f798d8 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #5: torch::jit::LowerAllTuples(std::shared_ptrtorch::jit::Graph&) + 0x29 (0x7fd215f9c3f9 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch.so) frame #6: + 0x454cbb (0x7fd2fbcd0cbb in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #7: + 0x1d3ef4 (0x7fd2fba4fef4 in /home/jiapy/virtualEnv/py3torch1.2/lib/python3.6/site-packages/torch/lib/libtorch_python.so) frame #8: python() [0x50746c] frame #10: python() [0x5057d7] frame #11: python() [0x506ac3] frame #12: python() [0x507330] frame #14: python() [0x5057d7] frame #15: python() [0x506ac3] frame #16: python() [0x507330] frame #18: python() [0x5057d7] frame #19: python() [0x506ac3] frame #20: python() [0x507330] frame #22: python() [0x504e80] frame #23: python() [0x506ac3] frame #24: python() [0x507330] frame #26: python() [0x5064e4] frame #27: python() [0x507330] frame #29: python() [0x504e80] frame #31: python() [0x646a12] frame #36: __libc_start_main + 0xf0 (0x7fd300ee6830 in /lib/x86_64-linux-gnu/libc.so.6)

dzyjjpy avatar Oct 11 '19 09:10 dzyjjpy

I use pytorch==1.2 torchvision==0.4, Does anyone have similar experience? no problem when running "python tools/test.py", in coco/mpii

dzyjjpy avatar Oct 11 '19 09:10 dzyjjpy

it works when torch==1.0, tensorboardX==1.6

dzyjjpy avatar Oct 12 '19 01:10 dzyjjpy

@dzyjjpy My version: torch==1.0.0 torchvision==0.2.1 tensorboardX==1.6

TeeboneTing avatar Oct 17 '19 07:10 TeeboneTing

@dzyjjpy hello, can i ask you one issue?

eng100200 avatar Oct 23 '19 07:10 eng100200

Thank your issue. My version: pytorch == 1.2.0, torchvision == 0.4.0, tensorbloardx = 1.6. It works, but other bugs comes. πŸ˜‚

ZP-Guo avatar Oct 28 '19 03:10 ZP-Guo

@TeeboneTing thanks. @eng100200 pls describe your issue.

dzyjjpy avatar Oct 28 '19 03:10 dzyjjpy

@GZP123 have you run the test code on mpii for w32?

eng100200 avatar Oct 28 '19 04:10 eng100200

@GZP123 have you run the test code on mpii for w32?

I do not. I have the similar bug with @dzyjjpy πŸ˜‚ . And I can not run, let me try to so it more.

ZP-Guo avatar Oct 28 '19 04:10 ZP-Guo

@GZP123 have you run the test code on mpii for w32?

Now, my version is: pytorch == 1.0.1, torchvision == 0.2.1, tensorboardx = 1.6. And I can run the train code. For the test code, I want to wait some days.

ZP-Guo avatar Oct 28 '19 04:10 ZP-Guo

@dzyjjpy hello, i want to ask few questions;

  1. Why we flipped images in testing the mpii dataset?
  2. I think mpii provide bounding box, but, i am not sure. However, how did you get the bounding boxes for mpii and how you computed center and scale?
  3. Further, the validation set of mpii used for testing do you single person detection or multi-person detection?
  4. how can i display images after i detect the pose points, i mean disply with pose points?
  5. how can i test any image of my own?

eng100200 avatar Oct 28 '19 06:10 eng100200

@GZP123 did you train using mpii?

eng100200 avatar Oct 28 '19 06:10 eng100200

@GZP123 did you train using mpii?

I do it and the speed is enough. But I can not train for MS COCO, because of my GPUs. I have ubuntu 16.0 and 2 GTX 1080 GPUs. Maybe I need to modify the batch size if I want to train for MS COCO.

ZP-Guo avatar Oct 29 '19 02:10 ZP-Guo

@ZP-Guo thanks for reply. I am also planning to train using mpii, but, i would do it for compact model, model have less than 8G flops....and also my target is to train for sixteen points only. Can i ask you one question, why the scale is multipled by 1.25 when center has positive value? if center[0] != -1: scale = scale * 1.25

eng100200 avatar Oct 29 '19 08:10 eng100200

@ZP-Guo thanks for reply. I am also planning to train using mpii, but, i would do it for compact model, model have less than 8G flops....and also my target is to train for sixteen points only. Can i ask you one question, why the scale is multipled by 1.25 when center has positive value? if center[0] != -1: scale = scale * 1.25

So sorry I can not give you an answer because I do not pay much attention to code details. I think the author often set some value empirically, so I can not explain it. For example, HRNet is trained for face datasets. You know, HRNet can complete many tasks, such as human pose estimation and face detection. Someone asks a question how to get the "center" and "scale" values in an issue that comes from face detection task. And I find the author set a value "200" empirically. This issue is found by my friend and he told me when we chat so that I do not save it. What I want to say is that maybe he, the author sets some value empirically, not some reasons for calculation. Maybe many problems that we can not solve come. But do not give up and we will do it. Come on, bro. 😎

ZP-Guo avatar Oct 29 '19 12:10 ZP-Guo

@ZP-Guo i understand your answer...i would try to go deeper,,,and if i find answer i would share with you. However, i am using HRNET for pose estimation.

eng100200 avatar Oct 30 '19 01:10 eng100200

@ZP-Guo i understand your answer...i would try to go deeper,,,and if i find answer i would share with you. However, i am using HRNET for pose estimation.

Maybe I understand CNN not enough. For example, why VGG-16 has 64, 128, 256 channels. I think the author set them empirically and I can not explain it via using calculation. I also use CNN for human pose estimations, as you. And I will thank you for sharing if you would share what you find.

ZP-Guo avatar Oct 31 '19 03:10 ZP-Guo

@ZP-Guo shake hands,,,,i think num of channels question is just a empricial choice....it could have many reasoning,,,like cost of memory,...computation.....enough information from these number of channels..... do you have wechat?

eng100200 avatar Oct 31 '19 03:10 eng100200

@ZP-Guo shake hands,,,,i think num of channels question is just a empricial choice....it could have many reasoning,,,like cost of memory,...computation.....enough information from these number of channels..... do you have wechat?

I mean why they choose 64, 128, 256, and so on. 128 = 64 * 2, 256 = 128 * 2. Why not it is 61, 122, 244 or 55, 110, 220. But it does not matter. Maybe it is okay that I know how to use it. And we can exhange experience via e-mail. You can get the e-mail on my homepage. 😊

ZP-Guo avatar Nov 01 '19 01:11 ZP-Guo

@ZP-Guo @dzyjjpy @alex-razor have you recieved error "raise RuntimeError("Failed to export an ONNX attribute, " RuntimeError: Failed to export an ONNX attribute, since it's not constant, please try to make things (e.g., kernel size) static if possible" during training using mpii dataset? i have installed pytorch=1.1.0 and tensorboradX =1.6? should i upgrade tensorboardX?

eng100200 avatar Nov 01 '19 10:11 eng100200

@ZP-Guo @dzyjjpy @alex-razor have you recieved error "raise RuntimeError("Failed to export an ONNX attribute, " RuntimeError: Failed to export an ONNX attribute, since it's not constant, please try to make things (e.g., kernel size) static if possible" during training using mpii dataset? i have installed pytorch=1.1.0 and tensorboradX =1.6? should i upgrade tensorboardX?

So sorry, I have not met this bug. Maybe you can update your environment according to my version.

ZP-Guo avatar Nov 02 '19 02:11 ZP-Guo

sorry could tell total version about this work, i want to run train coco dataSet but always appear cpu_nms.py", line 12 cimport numpy as np

1037861070 avatar Nov 05 '19 02:11 1037861070

@dzyjjpy hello, i want to ask few questions;

  1. Why we flipped images in testing the mpii dataset?
  2. I think mpii provide bounding box, but, i am not sure. However, how did you get the bounding boxes for mpii and how you computed center and scale?
  3. Further, the validation set of mpii used for testing do you single person detection or multi-person detection?
  4. how can i display images after i detect the pose points, i mean disply with pose points?
  5. how can i test any image of my own?

Hello, did you find answers for your questions? I'm also curious.

Muhtasham avatar Mar 04 '20 09:03 Muhtasham

@dzyjjpy you mean flip for testing? part 2 to 5 i can answer your questions

eng100200 avatar Mar 06 '20 02:03 eng100200

@dzyjjpy you mean flip for testing? part 2 to 5 i can answer your questions

Can you please answer part 2 to 5

Muhtasham avatar Mar 06 '20 09:03 Muhtasham

None in nn.ModuleList break the JIT in higher version of PyTorch. This problem is resolved by this issue. https://github.com/pytorch/pytorch/issues/30459.

yangsenius avatar Apr 28 '20 08:04 yangsenius