RelationalGraphLearning icon indicating copy to clipboard operation
RelationalGraphLearning copied to clipboard

train.py --policy cadrl problem

Open MandyZhang4869 opened this issue 1 year ago • 5 comments

Output directory already exists! Overwrite the folder? (y/n)y 2023-08-09 19:38:00, INFO: Current git head hash code: 8e87aa5ed8221efd688f8e6857ba4c38637bf6e1 2023-08-09 19:38:00, INFO: Current config content is :<module 'config' from 'data/output/config.py'> 2023-08-09 19:38:00, INFO: Using device: cpu 2023-08-09 19:38:00, INFO: Similarity_func: embedded_gaussian 2023-08-09 19:38:00, INFO: Layerwise_graph: False 2023-08-09 19:38:00, INFO: Skip_connection: True 2023-08-09 19:38:00, INFO: Number of layers: 2 2023-08-09 19:38:00, INFO: Similarity_func: embedded_gaussian 2023-08-09 19:38:00, INFO: Layerwise_graph: False 2023-08-09 19:38:00, INFO: Skip_connection: True 2023-08-09 19:38:00, INFO: Number of layers: 2 2023-08-09 19:38:00, INFO: Planning depth: 1 2023-08-09 19:38:00, INFO: Planning width: 1 2023-08-09 19:38:00, INFO: Sparse search: None 2023-08-09 19:38:00, INFO: human number: 5 2023-08-09 19:38:00, INFO: Not randomize human's radius and preferred speed 2023-08-09 19:38:00, INFO: Training simulation: circle_crossing, test simulation: circle_crossing 2023-08-09 19:38:00, INFO: Square width: 20, circle width: 4 2023-08-09 19:38:00, INFO: Lr: 0.001 for parameters graph_model.w_a graph_model.w_r.0.weight graph_model.w_r.0.bias graph_model.w_r.2.weight graph_model.w_r.2.bias graph_model.w_h.0.weight graph_model.w_h.0.bias graph_model.w_h.2.weight graph_model.w_h.2.bias graph_model.Ws.0 graph_model.Ws.1 value_network.0.weight value_network.0.bias value_network.2.weight value_network.2.bias value_network.4.weight value_network.4.bias value_network.6.weight value_network.6.bias graph_model.w_a graph_model.w_r.0.weight graph_model.w_r.0.bias graph_model.w_r.2.weight graph_model.w_r.2.bias graph_model.w_h.0.weight graph_model.w_h.0.bias graph_model.w_h.2.weight graph_model.w_h.2.bias graph_model.Ws.0 graph_model.Ws.1 human_motion_predictor.0.weight human_motion_predictor.0.bias human_motion_predictor.2.weight human_motion_predictor.2.bias with Adam optimizer 0%| | 0/2000 [00:00<?, ?it/s]Traceback (most recent call last): File "train.py", line 268, in <module> main(sys_args) File "train.py", line 168, in main explorer.run_k_episodes(il_episodes, 'train', update_memory=True, imitation_learning=True) File "/RelationalGraphLearning/crowd_nav/utils/explorer.py", line 43, in run_k_episodes ob = self.env.reset(phase) TypeError: reset() takes 1 positional argument but 2 were given 0%| | 0/2000 [00:00<?, ?it/s]

I am sorry to bother you, but I couldnt understand and solve this problem...

MandyZhang4869 avatar Aug 09 '23 11:08 MandyZhang4869

@MandyZhang4869 please check this link: https://github.com/vita-epfl/CrowdNav/issues/45

Looks like you need downgrade the gym version.

ChanganVR avatar Aug 12 '23 20:08 ChanganVR

emm, What about this one? This is the error encountered while running RGL. 【I am so sorry to bother you again】 2023-08-16 19:13:34, INFO: Lr: 0.001 for parameters graph_model.w_a graph_model.w_r.0.weight graph_model.w_r.0.bias graph_model.w_r.2.weight graph_model.w_r.2.bias graph_model.w_h.0.weight graph_model.w_h.0.bias graph_model.w_h.2.weight graph_model.w_h.2.bias graph_model.Ws.0 graph_model.Ws.1 value_network.0.weight value_network.0.bias value_network.2.weight value_network.2.bias value_network.4.weight value_network.4.bias value_network.6.weight value_network.6.bias graph_model.w_a graph_model.w_r.0.weight graph_model.w_r.0.bias graph_model.w_r.2.weight graph_model.w_r.2.bias graph_model.w_h.0.weight graph_model.w_h.0.bias graph_model.w_h.2.weight graph_model.w_h.2.bias graph_model.Ws.0 graph_model.Ws.1 human_motion_predictor.0.weight human_motion_predictor.0.bias human_motion_predictor.2.weight human_motion_predictor.2.bias with Adam optimizer 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████▉| 1998/2000 [00:38<00:00, 45.84it/s]2023-08-16 19:14:13, INFO: TRAIN has success rate: 0.89, collision rate: 0.09, nav time: 12.23, total reward: 0.2389, average return: 0.4869 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:38<00:00, 52.21it/s] Traceback (most recent call last): File "train.py", line 268, in <module> main(sys_args) File "train.py", line 169, in main trainer.optimize_epoch(il_epochs) File "/RelationalGraphLearning/crowd_nav/utils/trainer.py", line 83, in optimize_epoch loss.backward() File "/root/anaconda3/envs/new_drl37/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward self, gradient, retain_graph, create_graph, inputs=inputs File "/root/anaconda3/envs/new_drl37/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [100, 6, 32]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

MandyZhang4869 avatar Aug 16 '23 11:08 MandyZhang4869

ok i downgrade the torch version

MandyZhang4869 avatar Aug 16 '23 13:08 MandyZhang4869

Using Python 3.7.17, I had to downgrade Gym to 0.22.0, Torch to 1.9.0 and Torchvision to 0.10.0. Now everything seems to work fine.

I also modified the following:

  • line 43 in crowd_nav/utils/explorer.py from "ob = self.env.reset(phase)" to "ob = self.env.reset(phase=phase)".
  • line 110 in crowd_nav/test.py from "ob = env.reset(args.phase, args.test_case)" to "ob = env.reset(phase=args.phase, test_case=args.test_case)".
  • line 125 in crowd_nav/test.py from "env.render('traj', args.video_file)" to "env.render(mode='traj', output_file=args.video_file)".
  • line 133 in crowd_nav/test.py from "env.render('video', args.video_file)" to "env.render(mode='video', output_file=args.video_file)".

TommasoVandermeer avatar Feb 12 '24 10:02 TommasoVandermeer

Additionally, if you dont want pytorch version changed, just modify https://github.com/ChanganVR/RelationalGraphLearning/blob/8e87aa5ed8221efd688f8e6857ba4c38637bf6e1/crowd_nav/policy/graph_model.py#L127 to "next_H = next_H + H". And hope @ChanganVR to correct it.

haofuly avatar Apr 05 '24 13:04 haofuly