mujoco-py
mujoco-py copied to clipboard
Mujoco Ant-v2 didn't restart the env when ant is flip over
I am using the Mujoco Ant-v2 enviroment with my DDPG model However, my reward can only get about 300-400 average. Thus I check the env.render to see more detail , but i see that when the ant is flip over the done from env.step(action) is still False, which let the Ant to hit the Max episode =1000(env.render is5000) to restart and get the survive reward every time because when my reward is high (about 600~700 ) it always show the ant is flip over and didn't forward it looks like my model learn that to flip over is the best reward Is this a common situation? and can someone tells when will the Ant "done" set to True ? Because I see the original Ant code is about
state = self.state_vector()
notdone = np.isfinite(state).all() and state[2] >= 0.2 and state[2] <= 1.0
done = not notdone
But I can't get the idea when the Ant done set to True . Thanks!
update:
I also check the paper from Benchmarking Deep Reinforcement Learning for Continuous Control
explain the Ant environment that
where zbody is the z-coordinate of the body
this correspond to the code in Ant-v2
where state[2]>=0.2 and state[2]<=1 will continue ,
But in the Ant-v2 I saw, it just continue when Ant flip over ,
Is it possible the Ant do the flip over without violate those condition?
Thanks for the interesting question! When I printed the z-coordinate of the body when the Ant robot was flipped over, the values were: 0.259, 0.261, 0.273, 0.302, etc. Hence, I wonder whether it is correct to modify the code such that when the z-coordinate of the body constantly falls under around 0.3, then the environment resets since the robot is flipped over.
@dkkim93 Thanks for your reply. I have the same observation too, and I was wondering is this just like a local optimal for this case ? Ant can flip over or stay still to get the survive reward(=1) in each step . So if the Ant flip over or stay still with small joint control (because the reward also has one term deal with the joint control penalty),the ant can get almost 900 reward (local optimal). However,the best reward is still move forward as fast as it can , I have tune the parameters for my ddpg and get the reward like 1200 for some episode and it forward quickly.
I've been noticing the same behavior. It's kind of unclear if this is intentional or not...
I had the same issue. So, we need to adjust healthy_z_range
for our desired task.
I hope this gif video helps you guys! This provides the same intuition with a comment above