SeqDex icon indicating copy to clipboard operation
SeqDex copied to clipboard

"Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation" and have encountered an issue during training.

Open jaxonXu98 opened this issue 11 months ago • 3 comments

  1. Memory Error

When running the following command:

python train_rlgames.py --task=BlockAssemblyOrient --num_envs=1024 I encountered a memory error, with the following traceback:

Traceback (most recent call last): File "train_rlgames.py", line 102, in runner.run(vargs) File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run self.run_train(args) File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train agent.train() File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1162, in train self.obs = self.env_reset() File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 470, in env_reset obs = self.vec_env.reset() File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/vec_task_rlgames.py", line 183, in reset self.task.step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step self.pre_physics_step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step self.reset_idx(env_ids, goal_env_ids) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1607, in reset_idx self.post_reset(env_ids, hand_indices, object_indices, rand_floats) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1664, in post_reset pos_err = self.segmentation_target_init_pos - self.rigid_body_states[:, self.hand_base_rigid_body_index, 0:3] RuntimeError: CUDA error: an illegal memory access was encountered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. I am currently using a single NVIDIA 4090 GPU. Could you please let me know how many GPUs (and which model) you used in your experiments? This will help me determine if the issue is related to hardware limitations.

When I reduce the number of num_envs to 64 and run the following command:

python train_rlgames.py --task=BlockAssemblyOrient --num_envs=64 I encounter another issue, with the following traceback:

Traceback (most recent call last): File "train_rlgames.py", line 102, in runner.run(vargs) File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 120, in run self.run_train(args) File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/torch_runner.py", line 101, in run_train agent.train() File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1173, in train step_time, play_time, update_time, sum_time, a_losses, c_losses, b_losses, entropies, kls, last_lr, lr_mul = self.train_epoch() File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 1037, in train_epoch batch_dict = self.play_steps() File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 636, in play_steps self.obs, rewards, self.dones, infos = self.env_step(res_dict['actions']) File "/home/jaho/anaconda3/envs/seqdex/lib/python3.8/site-packages/rl_games/common/a2c_common.py", line 458, in env_step obs, rewards, dones, infos = self.vec_env.step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/vec_task_rlgames.py", line 168, in step self.task.step(actions_tensor) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step self.pre_physics_step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step self.reset_idx(env_ids, goal_env_ids) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1467, in reset_idx self.saved_searching_ternimal_state = self.root_state_tensor.clone()[self.lego_indices.view(-1), :].view(self.num_envs, 108, 13) RuntimeError: shape '[64, 108, 13]' is invalid for input of size 109824 This error seems related to a shape mismatch after reducing num_envs to 64.

  1. PyTorch Version

I would also like to confirm the version of PyTorch you used for this project. I want to make sure that I am using the correct version to avoid any compatibility issues.

  1. Program Stopping After Running main_rlgames("BlockAssemblySearch", 128)

When I run the following command:

python scripts/bi-optimization.py --task=BlockAssembly The program executes only the first line:

search_policy_path = main_rlgames("BlockAssemblySearch", 128) However, the subsequent lines do not run:

orient_policy_path = main_rlgames("BlockAssemblyOrient", 512) grasp_sim_policy_path = main_rlgames("BlockAssemblyGraspSim", 512) insert_sim_policy_path = main_rlgames("BlockAssemblyInsertSim", 512) main_rlgames("BlockAssemblyInsertSim", 512, use_t_value=True, policy_path=insert_sim_policy_path) transition_value_trainer("BlockAssemblyInsertSim", rollout=10000) main_rlgames("BlockAssemblyGraspSim", 512, use_t_value=True, policy_path=grasp_sim_policy_path) transition_value_trainer("BlockAssemblyGraspSim", rollout=10000) main_rlgames("BlockAssemblyOrient", 128, use_t_value=True, policy_path=orient_policy_path) transition_value_trainer("BlockAssemblyOrient", rollout=10000) If I comment out the line search_policy_path = main_rlgames("BlockAssemblySearch", 128) after running it, essentially starting from orient_policy_path = main_rlgames("BlockAssemblyOrient", 512), I still encounter a memory error.

jaxonXu98 avatar Jan 06 '25 07:01 jaxonXu98

Hello, I also encountered a Memory issue. Have you solved it now?

pp5201314 avatar May 06 '25 03:05 pp5201314

reset self.task.step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step self.pre_physics_step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step self.reset_idx(env_ids, goal_env_ids) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1607, in reset_idx self.post_reset(env_ids, hand_indices, object_indices, rand_floats) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1664, in post_reset pos_err = self.segmentation_target_init_pos - self.rigid_body_states[:, self.hand_base_rigid_body_index, 0:3] RuntimeError: CUDA error: an illegal memory access was encountered

@j96w Yes, I also encounter the first memory error even with the A100, so maybe some error with the release code?

Ralph-cong avatar Jul 08 '25 03:07 Ralph-cong

hi~ I found the problem is triggered by out of the diaplay memory. PxgCudaDeviceMemoryAllocator fail to allocate memory 67108864 bytes!! Result = 2

It seems to be w.r.t the error aggregate num self.gym.begin_aggregate(env_ptr, max_agg_bodies, max_agg_shapes, True) So I tried to modify the max_agg_bodies and max_agg_shapes to be consistent with search task and it works.

        max_agg_bodies = 174
        max_agg_shapes = 271

By the way, the 108 should be modified to 132 which means the num of blocks. And in search task it's 132. That's why you encounter the problem(#9 ) of RuntimeError: shape '[64, 108, 13]' is invalid for input of size 109824. And the first problem above is maybe also related to this because the max_agg_bodies is related to the block nums

Ralph-cong avatar Jul 11 '25 07:07 Ralph-cong