PHC why disc_reward_mean converges to zero after convergence.

@ZhengyiLuo ，Hello，by train g1 with command：

python phc/run_hydra.py project_name=Robot_IM robot=unitree_g1 env=env_im_g1_phc env.motion_file=sample_data/0-DanceDB_20120807_CliodelaVara_Clio_Haniotikos_C3D_poses.pkl learning=im_pnn_big exp_name=unitree_g1_DanceDB_20120807_CliodelaVara_Clio_Haniotikos_C3D_poses sim=robot_sim control=robot_control learning.params.network.space.continuous.sigma_init.val=-1.7

we can get a trained wandb log curve like the following ,

111111

and my question is why disc_reward_mean converges to near zero but mean reward go to high,after convergence. and what's the meaning of disc_reward_mean , is it represent the amp style learning not well,

What does disc_reward_mean=1 mean, and what does disc_reward_mean=0 mean? Does a value closer to 0 indicate that the AMP learning style is more similar? Why does the video GIF effect look good after 24,000 iterations in Isaac Gym, but disc_reward_mean has already approached 0? Should disc_reward_mean be better the closer it is to 0, or the closer it is to 1? and the related code in amp_agent.py:

and related code in amp_players.py

Nov 28 '24 05:11 dbdxnuliba

that is loss of the amp dicriminator. go to zero means good. See this paper https://bit.ly/3hpvbD6

Dec 03 '24 07:12 luoye2333

@luoye2333 @ZhengyiLuo but ,the function name in the code is _calc_dis_rewards. I thought the larger the reward, the better the style learned by amp, but it turns out it's loss?

and the amp paper related is the following , And is the disc_mean_reward in Phc Which part corresponds to the formula in the amp paper

Dec 05 '24 03:12 dbdxnuliba

@dbdxnuliba disc_reward_mean is indeed the discriminator reward. It looks like the reward was high but then goes back to 0? Looks like the discriminator experienced mode collapse. The motion looks pretty good though so that's good news? Does this always happen for other motions as well? Here is my disc_reward_mean for some of the H1 experiments I run. It stays around 0.3 & 0.2

Dec 10 '24 01:12 ZhengyiLuo

Sorry i mixed disc_reward_mean with disc_loss in my previous comment.

I am currently training with no getup (fall recovery), so i removed the amp style reward (set disc_reward_w: 0. in yaml ). The results are good in convergence speed and evaluation success rate.(compared with the runs with the amp reward). So i guess if we dont need fall recovery, style reward do not play an important part.

But in fall recovery in my runs, though amp style reward is added, the fall recovery behaviour is still not like the human behaviour. Perhaps it is because the amp style reward is a very weak constraint. This is my training curve. 2024-12-10 10-02-45屏幕截图

Dec 10 '24 02:12 luoye2333

@ZhengyiLuo Hi author! What do you mean by mode collapse? Shall we increase the size of the discriminator network? Or the problem with gradient penalty of disc?

Dec 10 '24 02:12 luoye2333

@ZhengyiLuo Thank you for your reply. My question is, isn't AMP used for style tracking? Why does the reward for AMP tend to 0 during dancing, yet the tracked style after training still looks very similar?

Dec 10 '24 15:12 dbdxnuliba

PHC PHC copied to clipboard

why disc_reward_mean converges to zero after convergence.

PHC
PHC copied to clipboard