"Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation" and have encountered an issue during training.
- Memory Error
When running the following command:
python train_rlgames.py --task=BlockAssemblyOrient --num_envs=1024 I encountered a memory error, with the following traceback:
Traceback (most recent call last):
File "train_rlgames.py", line 102, in TORCH_USE_CUDA_DSA to enable device-side assertions.
I am currently using a single NVIDIA 4090 GPU. Could you please let me know how many GPUs (and which model) you used in your experiments? This will help me determine if the issue is related to hardware limitations.
When I reduce the number of num_envs to 64 and run the following command:
python train_rlgames.py --task=BlockAssemblyOrient --num_envs=64 I encounter another issue, with the following traceback:
Traceback (most recent call last):
File "train_rlgames.py", line 102, in
- PyTorch Version
I would also like to confirm the version of PyTorch you used for this project. I want to make sure that I am using the correct version to avoid any compatibility issues.
- Program Stopping After Running main_rlgames("BlockAssemblySearch", 128)
When I run the following command:
python scripts/bi-optimization.py --task=BlockAssembly The program executes only the first line:
search_policy_path = main_rlgames("BlockAssemblySearch", 128) However, the subsequent lines do not run:
orient_policy_path = main_rlgames("BlockAssemblyOrient", 512) grasp_sim_policy_path = main_rlgames("BlockAssemblyGraspSim", 512) insert_sim_policy_path = main_rlgames("BlockAssemblyInsertSim", 512) main_rlgames("BlockAssemblyInsertSim", 512, use_t_value=True, policy_path=insert_sim_policy_path) transition_value_trainer("BlockAssemblyInsertSim", rollout=10000) main_rlgames("BlockAssemblyGraspSim", 512, use_t_value=True, policy_path=grasp_sim_policy_path) transition_value_trainer("BlockAssemblyGraspSim", rollout=10000) main_rlgames("BlockAssemblyOrient", 128, use_t_value=True, policy_path=orient_policy_path) transition_value_trainer("BlockAssemblyOrient", rollout=10000) If I comment out the line search_policy_path = main_rlgames("BlockAssemblySearch", 128) after running it, essentially starting from orient_policy_path = main_rlgames("BlockAssemblyOrient", 512), I still encounter a memory error.
Hello, I also encountered a Memory issue. Have you solved it now?
reset self.task.step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/hand_base/base_task.py", line 135, in step self.pre_physics_step(actions) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1712, in pre_physics_step self.reset_idx(env_ids, goal_env_ids) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1607, in reset_idx self.post_reset(env_ids, hand_indices, object_indices, rand_floats) File "/home/jaho/pythonProject/SeqDex-master/SeqDex/dexteroushandenvs/tasks/block_assembly/allegro_hand_block_assembly_orient.py", line 1664, in post_reset pos_err = self.segmentation_target_init_pos - self.rigid_body_states[:, self.hand_base_rigid_body_index, 0:3] RuntimeError: CUDA error: an illegal memory access was encountered
@j96w Yes, I also encounter the first memory error even with the A100, so maybe some error with the release code?
hi~ I found the problem is triggered by out of the diaplay memory.
PxgCudaDeviceMemoryAllocator fail to allocate memory 67108864 bytes!! Result = 2
It seems to be w.r.t the error aggregate num self.gym.begin_aggregate(env_ptr, max_agg_bodies, max_agg_shapes, True)
So I tried to modify the max_agg_bodies and max_agg_shapes to be consistent with search task and it works.
max_agg_bodies = 174
max_agg_shapes = 271
By the way, the 108 should be modified to 132 which means the num of blocks. And in search task it's 132. That's why you encounter the problem(#9 ) of RuntimeError: shape '[64, 108, 13]' is invalid for input of size 109824. And the first problem above is maybe also related to this because the max_agg_bodies is related to the block nums