Deep_Object_Pose
Deep_Object_Pose copied to clipboard
Generating Belief Maps using train2/train.py
Hi I am attempting to run a the training script and generate the belief maps from train2/train.py in order to debug but I am getting this error:
start: 18:18:30.781464
load data: ['/home/user/Downloads/Spanner2']
load data:
training data: 2000 batches
load models
ready to train!
Traceback (most recent call last):
File "train.py", line 606, in
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 17249) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/user/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2022-04-06_18:18:39 host : user-User rank : 0 (local_rank: 0) exitcode : 1 (pid: 17249) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
I am unsure what is causing this error as I have the correct versions of Pytorch install based on requirements.txt. Is there any common mistakes I could be making?
Could you share an example of json file you are using in your dataset. It looks like
p = [point[numb_point][1],point[numb_point][0]]
point looks empty of the dimensions are wrong. @mintar refactored the data format a little bit, I did not check if it was compatible with train2/train.py? But I will try to check soon.
@BazUCD Did you try to use the original training script?
Hi @TontonTremblay thanks for the quick reply. Heres an example of my .json files with the associated png as well:
I've used the original training script and generated some weights but was unable to detect anything so after your recommendation from #238 I have been trying to generate the belief maps using train2
This looks correct, but your object has a symmetry in it. https://github.com/NVlabs/Deep_Object_Pose/tree/master/scripts/nvisii_data_gen#handling-objects-with-symmetries you should look into this from Martin.
I encountered a similar issue. The training script expects the "projected_cuboid"
field to contain 9 points. The last point being the point under"projected_cuboid_centroid"
.
In your case, you can add something like projected_cuboid_keypoints.append(obj['projected_cuboid_centroid'])
right below line 228 in utils_dope.py
. I did this and it worked for me.