PHC icon indicating copy to clipboard operation
PHC copied to clipboard

why disc_reward_mean converges to zero after convergence.

Open dbdxnuliba opened this issue 1 year ago • 6 comments

@ZhengyiLuo ,Hello,by train g1 with command:

python phc/run_hydra.py project_name=Robot_IM robot=unitree_g1 env=env_im_g1_phc env.motion_file=sample_data/0-DanceDB_20120807_CliodelaVara_Clio_Haniotikos_C3D_poses.pkl learning=im_pnn_big exp_name=unitree_g1_DanceDB_20120807_CliodelaVara_Clio_Haniotikos_C3D_poses sim=robot_sim control=robot_control learning.params.network.space.continuous.sigma_init.val=-1.7

we can get a trained wandb log curve like the following , image image

111111

and my question is why disc_reward_mean converges to near zero but mean reward go to high,after convergence. and what's the meaning of disc_reward_mean , is it represent the amp style learning not well,

What does disc_reward_mean=1 mean, and what does disc_reward_mean=0 mean? Does a value closer to 0 indicate that the AMP learning style is more similar? Why does the video GIF effect look good after 24,000 iterations in Isaac Gym, but disc_reward_mean has already approached 0? Should disc_reward_mean be better the closer it is to 0, or the closer it is to 1? and the related code in amp_agent.py:

image and related code in amp_players.py image

dbdxnuliba avatar Nov 28 '24 05:11 dbdxnuliba

that is loss of the amp dicriminator. go to zero means good. See this paper https://bit.ly/3hpvbD6

luoye2333 avatar Dec 03 '24 07:12 luoye2333

@luoye2333 @ZhengyiLuo but ,the function name in the code is _calc_dis_rewards. I thought the larger the reward, the better the style learned by amp, but it turns out it's loss?

and the amp paper related is the following , And is the disc_mean_reward in Phc Which part corresponds to the formula in the amp paper image image

dbdxnuliba avatar Dec 05 '24 03:12 dbdxnuliba

@dbdxnuliba disc_reward_mean is indeed the discriminator reward. It looks like the reward was high but then goes back to 0? Looks like the discriminator experienced mode collapse. The motion looks pretty good though so that's good news? Does this always happen for other motions as well? Here is my disc_reward_mean for some of the H1 experiments I run. It stays around 0.3 & 0.2

image

ZhengyiLuo avatar Dec 10 '24 01:12 ZhengyiLuo

Sorry i mixed disc_reward_mean with disc_loss in my previous comment.

I am currently training with no getup (fall recovery), so i removed the amp style reward (set disc_reward_w: 0. in yaml ). The results are good in convergence speed and evaluation success rate.(compared with the runs with the amp reward). So i guess if we dont need fall recovery, style reward do not play an important part.

But in fall recovery in my runs, though amp style reward is added, the fall recovery behaviour is still not like the human behaviour. Perhaps it is because the amp style reward is a very weak constraint. This is my training curve. 2024-12-10 10-02-45屏幕截图

luoye2333 avatar Dec 10 '24 02:12 luoye2333

@ZhengyiLuo Hi author! What do you mean by mode collapse? Shall we increase the size of the discriminator network? Or the problem with gradient penalty of disc?

luoye2333 avatar Dec 10 '24 02:12 luoye2333

@ZhengyiLuo Thank you for your reply. My question is, isn't AMP used for style tracking? Why does the reward for AMP tend to 0 during dancing, yet the tracked style after training still looks very similar?

dbdxnuliba avatar Dec 10 '24 15:12 dbdxnuliba