adam icon indicating copy to clipboard operation
adam copied to clipboard

[Question] adam v0.4.0 PyTorch interface M, C, G calculation super slow?

Open briannnyee opened this issue 2 months ago • 4 comments

Hello devs,

I tried out adam v0.4.0 and found out that calculating M, C, G with PyTorch interface is super slow, regardless of batch size.

self.kinDyn = KinDynComputationsBatch(self.asset_path, self.actuated_joint_names)

self.M = self.kinDyn.mass_matrix(w_H_b, joint_pos) # (batch_size, 6 + n_joints, 6 + n_joints)
# Mass matrix computation time: 0.21195363998413086 s

self.C = self.kinDyn.coriolis_term(w_H_b, joint_pos, base_vel, joint_vel) # (batch_size, 6 + n_joints)
# Coriolis term computation time: 0.3075296878814697 s

self.G = self.kinDyn.gravity_term(w_H_b, joint_pos) # (batch_size, 6 + n_joints)
# Gravity term computation time: 0.2587003707885742 s

All computations were on a L4 GPU. Is it normal to have this speed or do I miss something?

Thanks in advance!

briannnyee avatar Oct 21 '25 18:10 briannnyee

Hi @briannnyee and thanks for the feedback! :)

Which is the batch size you're using? We might also double check if all the computations are done on the same device. You could save some time calling directly the funcion computing the bias force, being the sum of Coriolis and gravity term, that does this computation once. Also, I don't know if you are computing the forward dynamics, but maybe the aba function might be slightly faster.

I would say that adam is not really optimized to be super fast, but with some tricks we might make it faster!

Giulero avatar Oct 29 '25 00:10 Giulero

Hello!

I tested batch_size=16 and 4096. The time it takes to compute is the same.

I am aware of both methods you mentioned and yes it makes sense to call these methods directly. Yet still aba takes roughly the same amount of time (0.2xx sec).

I understand that adam does not aim to be super fast but to be differentiable. Yet I’d argue that long compute time prevents adam from being practically usable, as it dramatically increases the training time, which contradicts the initiative of the library.

As the field starts to explore more the combination of model-based control and deep learning, for adam to gain popularity, it’d be nice if you can optimize it to be both fast and differentiable.

Let me know if I am being too greedy here ;P. Do you have plans in the near future to improve speed?

briannnyee avatar Oct 29 '25 01:10 briannnyee

Hi! You’re not being greedy - just honest, and that’s good! ;) I'm aware that faster execution would make the library much more practical, and we do plan to work on performance improvements.

In the meantime, could you help me with a few quick checks to guide optimization?

  • Are all computations running entirely on the GPU?

  • What precision are you using - float32 or float64?

  • How many steps per second do you see in Isaac Sim, and how many for adam’s aba (i.e., 1 / fun_exec_time × batch_size)?

Sadly I couldn't work actively in the next few days, but your checks would be super helpful for profiling and speeding things up. Thanks a lot for taking the time to test and share!

Giulero avatar Oct 30 '25 06:10 Giulero

Hi!

Happy to contribute some stats!!

  • Are all computations running entirely on the GPU?

Yes, all computations are on GPU.

  • What precision are you using - float32 or float64?

My tensors are in float32. However, I noticed that aba() -> _convert_to_arraylike() -> self.math.asarray() would convert them to float64. I am not sure if it is intended.

  • How many steps per second do you see in Isaac Sim, and how many for adam’s aba (i.e., 1 / fun_exec_time × batch_size)?

Not sure what you expect but I think this should serve the purpose. tqdm shows steps per second. Before adam joins training it is 8k steps/s and afterwards it's 1.8k steps/s with num_envs=4096. Also, these are aba() stats from torch profiler:

aba:
   Mean:   318.377 ms
   Std:    16.720 ms
   Min:    289.022 ms
   Max:    417.350 ms
   Median: 316.080 ms
   95th:   349.878 ms

Sadly I couldn't work actively in the next few days, but your checks would be super helpful for profiling and speeding things up. Thanks a lot for taking the time to test and share!

Thanks a lot for planning to optimize it! I am looking forward to the new update!!!

briannnyee avatar Oct 31 '25 14:10 briannnyee

Hi @briannnyee! Some optimization is performed on the algorithms by caching and precomputing some quantities, as see #143.

Optimized

URDF: stickbot.urdf | NDoF: 34 | batch: 4096 | dtype: torch.float32 | device: cuda | rep: mixed
RNEA: 112.229 ms/iter
ABA : 149.751 ms/iter
CRBA: 94.922 ms/iter

Previous

URDF: stickbot.urdf | NDoF: 34 | batch: 4096 | dtype: torch.float32 | device: cuda | rep: mixed
RNEA: 145.100 ms/iter
ABA: 190.172 ms/iter
CRBA: 143.114 ms/iter

This means that, for example, for aba we now have 1 / 0.150 * 4096 ~ 27306 steps/s.

Giulero avatar Nov 18 '25 09:11 Giulero