rlpd
rlpd copied to clipboard
The effect of the LayerNorm?
I have a question about LayerNorm. In the paper, you mentioned that if we implement LayerNorm in the network, the Q-values will be bounded by the norm of the weight layer. With the formula explained, I’m still perplexed by why the inequality holds for the last and the second-to-last term. To make the inequality hold, I think we should keep the norm of the output of LayerNorm less than 1. But this can not be guaranteed, could you please provide me with more descriptions of this conclusion?