mujoco
mujoco copied to clipboard
Causes of different behaviour in MJX and CPU MuJoCo
I've been taking a look at MJX, and I'm impressed with how smooth the usage is. As I've been playing around more with the tutorial code on my system, I noticed how e.g. a humanoid running policy that performs great in MJX usually ends up failing eventually when transferred to CPU MuJoCo.
https://github.com/google-deepmind/mujoco/assets/41113387/a078013a-3f1c-445a-9bc8-231b91509286
(video from the "MJX Policy in MuJoCo" cell of the tutorial, with rng = jax.random.PRNGKey(2)
)
This is of course not too unexpected, considering that I didn't use any domain randomization or other methods that would help with sim2sim transfer. I'm aware of this discussion about FP precision differences between MJX and regular MuJoCo: #1203. Beyond FP precision, are there other key differences in the two versions of the engine that can cause a failure to transfer policies (provided only features that are officially supported in both versions are used)? Are there certain settings that can be used with CPU MuJoCo to make it behave closer to MJX (or vice versa)?
Hi @Balint-H - glad you're finding MJX easy to use. While it's possible you've found some discrepancy between MuJoCo and MJX, it's also possible you picked a bad seed for the rollout. The RL policies in that colab aren't optimized for stability across seeds - you'll see that in the standard deviation bars in the reward evaluation graphs.
I think for us to determine there's an issue here, you might want to try doing, say, 512 policy rollouts on MJX and the same number on MuJoCo, and then showing the reward distributions are actually different. If you suspect there's an issue, please try that and let us know.
To answer your question, here are the calculation differences between MJX and MuJoCo that come to mind:
- The float precision, as you mention
- MJX's convex<>convex collision algorithms are different from MuJoCo (shouldn't matter in your demo)
- MJX uses a slightly tweaked linesearch op inside its solver compared to MuJoCo, for performance reasons
That's it! Barring any bugs, of course, which we are happy to investigate should you find a repro. Cheers.
Hello, running with 20 different seeds, the episode lengths of the trained humanoid running policy are below (terminating when falling over, or reaching 500 decision steps):
MJX: [500, 500, 500, 500, 500, 77, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500] Average reward: 4567.945
MJX transferred to MuJoCo CPU: [363, 170, 173, 212, 404, 500, 204, 286, 321, 500, 391, 404, 252, 466, 427, 500, 249, 149, 279, 500] Average reward: 1475.5411
@erikfrey Is this much discrepancy expected? If not, is there perhaps a bug in tutorial code (e.g. some postprocessing not being applied to the actions for the CPU version)?
Here's a modified version of the colab tutorial that repeats the evals for RNG seeds 0-19 (used for the results above): https://github.com/Balint-H/mujoco/blob/colab/mjx/mjx/tutorial.ipynb
Please let me know if I messed up the process of editing the tutorial somewhere.
This is very helpful - thank you! I have a hunch as to what's going on here. I'll take a look.
@erikfrey any updates?
This sounds like a thread worth pulling on...
@Balint-H has there been any solution to this? This would seem critical?? I am also in the process of soon transferring the policy to Mujoco cpu, so wondering if this was resolved or not.
At the moment I think the guideline is to train more robust agents, then finetune on cpu if needed. Although it has been a while since I tried he transfer, some of the tweaks to MJX might have changed this.
I see, because I also noticed this decline in my application between MJX and Mujoco, of course I currently suspect it's due to something in my code not MJX, but then I came across this post and was wondering about it.