EfficientZero icon indicating copy to clipboard operation
EfficientZero copied to clipboard

Reward clipping and value transformation

Open zhixuan-lin opened this issue 4 years ago • 1 comments

Hello,

Thanks for this great work! I noticed that you choose to clip the reward to [-1, 1] for Atari. I'm wondering what's the purpose of applying value transformation (i.e. scalar_transform) if you already have the reward clipped?

zhixuan-lin avatar Mar 05 '22 01:03 zhixuan-lin

When data is limited, the reward function can be trained more easily by clipping rewards. That's the reason for reward clipping.

As for why applying value transformation, we choose the cross-entropy loss for the reward prediction to learn reward distributions, which is different from the MSE loss between the scalars. Moreover, since we choose value prefix in place of rewards, the output reward is not in the range of [-1, 1], but [-5, 5].

Hope this can help you:)

YeWR avatar Apr 29 '22 02:04 YeWR