[Question] adam v0.4.0 PyTorch interface M, C, G calculation super slow?
Hello devs,
I tried out adam v0.4.0 and found out that calculating M, C, G with PyTorch interface is super slow, regardless of batch size.
self.kinDyn = KinDynComputationsBatch(self.asset_path, self.actuated_joint_names)
self.M = self.kinDyn.mass_matrix(w_H_b, joint_pos) # (batch_size, 6 + n_joints, 6 + n_joints)
# Mass matrix computation time: 0.21195363998413086 s
self.C = self.kinDyn.coriolis_term(w_H_b, joint_pos, base_vel, joint_vel) # (batch_size, 6 + n_joints)
# Coriolis term computation time: 0.3075296878814697 s
self.G = self.kinDyn.gravity_term(w_H_b, joint_pos) # (batch_size, 6 + n_joints)
# Gravity term computation time: 0.2587003707885742 s
All computations were on a L4 GPU. Is it normal to have this speed or do I miss something?
Thanks in advance!
Hi @briannnyee and thanks for the feedback! :)
Which is the batch size you're using? We might also double check if all the computations are done on the same device.
You could save some time calling directly the funcion computing the bias force, being the sum of Coriolis and gravity term, that does this computation once.
Also, I don't know if you are computing the forward dynamics, but maybe the aba function might be slightly faster.
I would say that adam is not really optimized to be super fast, but with some tricks we might make it faster!
Hello!
I tested batch_size=16 and 4096. The time it takes to compute is the same.
I am aware of both methods you mentioned and yes it makes sense to call these methods directly. Yet still aba takes roughly the same amount of time (0.2xx sec).
I understand that adam does not aim to be super fast but to be differentiable. Yet I’d argue that long compute time prevents adam from being practically usable, as it dramatically increases the training time, which contradicts the initiative of the library.
As the field starts to explore more the combination of model-based control and deep learning, for adam to gain popularity, it’d be nice if you can optimize it to be both fast and differentiable.
Let me know if I am being too greedy here ;P. Do you have plans in the near future to improve speed?
Hi! You’re not being greedy - just honest, and that’s good! ;) I'm aware that faster execution would make the library much more practical, and we do plan to work on performance improvements.
In the meantime, could you help me with a few quick checks to guide optimization?
-
Are all computations running entirely on the GPU?
-
What precision are you using - float32 or float64?
-
How many steps per second do you see in Isaac Sim, and how many for adam’s aba (i.e., 1 / fun_exec_time × batch_size)?
Sadly I couldn't work actively in the next few days, but your checks would be super helpful for profiling and speeding things up. Thanks a lot for taking the time to test and share!
Hi!
Happy to contribute some stats!!
- Are all computations running entirely on the GPU?
Yes, all computations are on GPU.
- What precision are you using - float32 or float64?
My tensors are in float32. However, I noticed that aba() -> _convert_to_arraylike() -> self.math.asarray() would convert them to float64. I am not sure if it is intended.
- How many steps per second do you see in Isaac Sim, and how many for adam’s aba (i.e., 1 / fun_exec_time × batch_size)?
Not sure what you expect but I think this should serve the purpose. tqdm shows steps per second. Before adam joins training it is 8k steps/s and afterwards it's 1.8k steps/s with num_envs=4096. Also, these are aba() stats from torch profiler:
aba:
Mean: 318.377 ms
Std: 16.720 ms
Min: 289.022 ms
Max: 417.350 ms
Median: 316.080 ms
95th: 349.878 ms
Sadly I couldn't work actively in the next few days, but your checks would be super helpful for profiling and speeding things up. Thanks a lot for taking the time to test and share!
Thanks a lot for planning to optimize it! I am looking forward to the new update!!!
Hi @briannnyee! Some optimization is performed on the algorithms by caching and precomputing some quantities, as see #143.
Optimized
URDF: stickbot.urdf | NDoF: 34 | batch: 4096 | dtype: torch.float32 | device: cuda | rep: mixed
RNEA: 112.229 ms/iter
ABA : 149.751 ms/iter
CRBA: 94.922 ms/iter
Previous
URDF: stickbot.urdf | NDoF: 34 | batch: 4096 | dtype: torch.float32 | device: cuda | rep: mixed
RNEA: 145.100 ms/iter
ABA: 190.172 ms/iter
CRBA: 143.114 ms/iter
This means that, for example, for aba we now have 1 / 0.150 * 4096 ~ 27306 steps/s.