osim-rl icon indicating copy to clipboard operation
osim-rl copied to clipboard

Is this reward function good for competition evaluation?

Open luckeciano opened this issue 4 years ago • 3 comments

Hey guys,

I would like to add a concern regarding the reward function.

After some analysis, I think it can be easily exploited for controllers that does not walk. Basically, the positive reward comes from the alive bonus and from footstep duration. An agent can just perform footsteps with no pelvis velocity (maintaining its initial position), or even just perform a long footstep from the beginning of the episode until the end without changing its position. In this way, the penalization is very low (the effort is low and there is no penalization from deviation because in the initial position Vtgt is a null vector).

As the objective of the competition is to learn to effectively walk following the navigation field, I think the reward function should be modified. My first thought is to add another factor that reinforces the idea of move. What do you guys think?

luckeciano avatar Sep 10 '19 13:09 luckeciano

@luckeciano May you elaborate on v_tgt being null at the initial position? How did you get this null vector?

smsong avatar Sep 11 '19 20:09 smsong

Hey @smsong,

Actually, I commited a mistake. The v_tgt is not null at initial position (I saw a point in the map, but there is an arrow as well). I'm sorry.

However, I printed the components from footstep reward and in this situation, the penalization is very low when compared with the total reward of just give a long footstep during the episode. In one of my tests, my agent did a single footstep, obtaining 47 of reward, losing only ~10 from effort and velocity deviation.

Therefore, it is possible to obtain almost all the possible reward without leaving the initial position. I think the reward should be modified - at least the weights. Otherwise, there is a possibility of top submissions without any walk motion.

luckeciano avatar Sep 12 '19 01:09 luckeciano

@luckeciano Thanks for the clarification and suggestion. However, if a network exploits this single footstep solution you've mentioned, it would probably get stuck at local minima and will not be able to compete with good solutions. And it is possible that some participants already got around this issue by using different rewards to first train a good network then fine-tune for the given reward. So it may be unfair to change the reward at this point. A systematic investigation on rewards that facilitate training can be an interesting study ;)

smsong avatar Sep 12 '19 03:09 smsong