ManiSkill [Question] Solving Pick-Cube from Pixels Only

Hey! I wanted to see if you guys had any reference code / hyperparameters for SAC solving any of the tabletop tasks using RGB(D) data only and no proprioceptive state information. Thanks!

Oct 30 '24 20:10 SumeetBatra

Sorry we have not tuned SAC at the moment, only PPO with some proprioception data + one RGB camera. There is some example code with state based SAC, a simple vision based one will come eventually. TD-MPC2 is already integrated and supports learning from pixels, does need much tuning.

If there's a lot of value in testing algorithms with visual only inputs we can try and help set it up in the future, we have some DM control environments benchmarked with PPO with an option to use visual only inputs.

Oct 30 '24 21:10 StoneT2000

I see, thanks for letting me know! I think having some baselines of end-to-end pixel to action policies would be useful. I am currently using SAC for my project but may also try out other algos in the future.

Oct 30 '24 21:10 SumeetBatra

Is GPU parallelization important in your case? Or are you working more on e.g. sample-efficiency. I can have some members on the team look to try and tune a RGB/RGBD SAC version.

Oct 30 '24 23:10 StoneT2000

It's not important, but if it makes policy convergence faster I'm for GPU parallelization. Sample efficiency is not an issue atm. I appreciate you all looking into this!

Oct 30 '24 23:10 SumeetBatra

Hey! Just wanted to check in and see if this is in the pipeline and if so, if you guys have an expected release date on it. Thanks!

Nov 13 '24 07:11 SumeetBatra

Currently working on it! Fixing up the SAC state and RGBD implementations now. will provide baseline for PickCube and maybe a few other tasks

Nov 15 '24 20:11 StoneT2000

Ok @SumeetBatra new baseline uploaded. I only checked it works for PushCube and PickCube from pixels. the suggested script to run

python sac_rgbd.py --env_id="PickCube-v1" --obs_mode="rgb" \
  --num_envs=32 --utd=0.5 --buffer_size=300_000 \
  --control-mode="pd_ee_delta_pos" --camera_width=64 --camera_height=64 \
  --total_timesteps=1_000_000 --eval_freq=10_000

was tested and converged after about 1-1.5 hours on a 4090. The SAC code can run faster if I add torch compile/cudagraphs support and add some shared memory optimization for observation storage but that will be done in the future.

https://github.com/user-attachments/assets/f04be6f4-1f7f-4519-9247-a22d3156a880

tiny 64x64 image in each corner is what the policy sees. Policy also sees any relevant state information (like goal position for the cube and agent joint positions).

See the SAC baseline readme: https://github.com/haosulab/ManiSkill/blob/main/examples/baselines/sac/README.md

I'm sure the other tasks work fine with just the same hyperparameters as PickCube training if trained long enough and appropriate controller is used.

Nov 16 '24 01:11 StoneT2000

@StoneT2000 Thank you so much!! I'll take a look and follow up if I have any questions.

Nov 22 '24 20:11 SumeetBatra

@StoneT2000 I had a chance to look over the sac_rgbd baseline and it looks like state information is in there by default. Is it possible to solve the task without having any proprioceptive state information, just rgb(d) observations only?

EDIT: For extra context, what I'm trying to avoid is needing a perception pipeline to estimate low dimensional state information when working on real hardware. If any state information is present, ideally it should come from somewhere else like inverse kinematics and not a noisy / brittle perception system. Now that I think about it, joint angles can come from IK, so maybe this solution avoids the need for a perception pipeline? I haven't worked with these systems before, so let me know if I'm misunderstanding something.

Nov 27 '24 22:11 SumeetBatra

Hi @SumeetBatra

So generally when it comes to sim2real / real2sim or testing if something might work in the real world at all, the state data that is accessible and quite accurate is

joint positions / qpos values
tcp_pose / end-effector pose / link poses (tcp_pose is one of the observation states given in PickCube all the time). These poses are available in the real world via IK using current joint positions.
anything else like "command" information. For example in PickCube a goal_pos is given in the observations which is a xyz position in 3D space.

We by default also give qvel values but these require estimation and are harder to align between sim and real so I would definitely just remove that (you don't need it to solve tasks usually, it might help with sample efficiency at times).

If you plan to do sim2real you will need to make modifications to environments for transfer regardless. By default envs in ManiSkill unless stated otherwise are designed more for algorithm benchmarking.

Also from image only is quite difficult, although maybe not impossible assuming the goal information is in the images somewhere (For PickCube it is not, but for peg insertion side or StackCube it is in the perceived image data). It is best to always include necessary goal information of the env, as well as qpos values and tcp poses if possible otherwise learning is slower.

Nov 27 '24 23:11 StoneT2000

This is really helpful, thanks!

What kind of modifications are needed to facilitate sim2real transfer? I'm guessing DR in the form of state observation noise and maybe some physics randomization at a minimum? Anything else I'm missing? And is there some existing pipeline for facilitating sim2real transfer in the repo? FYI I'm not concerned with the sim2real perception gap atm, mostly with sim2real physics gap and unmodeled dynamics.

Nov 27 '24 23:11 SumeetBatra

Hard to say, our lab is still finishing up some basic reproducible sim2real experiments that we will have relatively ready to share in a month or two I think. It is led by @Xander-Hinrichsen at the moment, he can comment a bit more on his own real experiences.

At minimum

object color randomization
green-screening a real world image as the background (works for static non-mobile robotics setups like a single arm)
observation noise for state related data like agent qpos
ensure your simulation controller behaves close to the real world controller. I'd recommend checking for each action you can take from some rest position in sim and real, verify the qpos of the robot in real and sim are very close and don't deviate. Our current recommendation that works decently well is to use pd_joint_target_delta_pos controllers, and to tune the real world controller to always try and achieve the joint target.

Then you easily train a RGB based policy in sim and do direct deployment in the real world for mostly simpler tasks of reaching/pushing/pulling type behaviors. Picking a cube is kind of hard still without more advanced tricks, @Xander-Hinrichsen and I are investigating how to make this as simple as possible without resorting to collecting real world demonstrations or combining RL with imitation learning.

Dec 16 '24 22:12 StoneT2000

@Xander-Hinrichsen Wonder if you could comment on what you found works and if you have a pipeline you can share! cc @StoneT2000

Jan 07 '25 00:01 SumeetBatra

Yes, as Stone has commented, I plan to have my pipeline posted in about a month for the Kochv1.1 arm, and possibly the SO-100 arm if time permits. Both are "affordable" robot arms from lerobot, though the pipeline is built to be extended by arbitrary robot arms in the future.

The process is intended to reflect maniskill's fidelity, so only simple randomizations are used, and there is little/none done in regard to sim2real physics gap and unmodeled dynamics: Task agnostic:

per scene camera randomizations (position, lookat position, and rotation about the viewing direction)
per scene lighting direction
robot arm color

PickCube specific (for task example):

cube color
cube size
cube friction

With these simple randomizations, I've had fairly successful policies zero shot for simple tasks like pickcube, and grab cube, using RGB and qpos observations only, and using the greenscreening overlay Stone mentioned earlier.

Randomizations I have tried but found unnecessary thus far:

added noise to proprioception observations
small perturbations to robot arm

Jan 07 '25 00:01 Xander-Hinrichsen

Ok, this is good to know, thanks! I'm guessing you're also using pd_joint_target_delta_pos controller as Stone mentioned?

Jan 07 '25 00:01 SumeetBatra

Yes, and target_qpos along with the qpos are used within the observations as well, but not qvel

Jan 07 '25 00:01 Xander-Hinrichsen

Alright thanks, I'll give this a try as well

Jan 07 '25 00:01 SumeetBatra

@Xander-Hinrichsen quick question. How did you go about randomizing the cube sizes? I instantiate the cube object first and then for each actor object try to modify it's rigid body and render body params (specifically half size), but this throws an error:

rigid_body_component.collision_shapes[0].half_size += size_noise render_body_component.render_shapes[0].half_size += size_noise

Is there another way I should be randomizing the cube half_size parameter for each env?

EDIT: Could you also let me know what real-world image dataset you use for background randomizations? Thanks!

Jan 07 '25 02:01 SumeetBatra

A good example of per scene randomization is in the _load_scene function of peginsertionside task

Jan 08 '25 06:01 Xander-Hinrichsen

Ok. Do you have a sim2real training script you could share by any chance. It might be easier for me to see what's going on that way.

Jan 10 '25 19:01 SumeetBatra