DHP
DHP copied to clipboard
On 'v' action rewards
Hello dear mr. Yuhang Song,
In the paper, it is mentioned that the rewards for action v are given by
And the parameters θv are optimized according to the rule:
In the code, https://github.com/YuhangSong/DHP/blob/73ddec2b837f0379cc5d0e008cd9dc422d832c3b/envs.py#L488-L502 there seems to be no reward for v calculated, instead v_lable is estimated as a "weighted" target value (sum of subject_i_v * similarity), https://github.com/YuhangSong/DHP/blob/73ddec2b837f0379cc5d0e008cd9dc422d832c3b/suppor_lib.py#L154-L159 which then contributes another term (v-v_lable)^2 in the loss function:
https://github.com/YuhangSong/DHP/blob/73ddec2b837f0379cc5d0e008cd9dc422d832c3b/a3c.py#L238-L239
Is there any particular reason why the direct sum of rewards is not calculated, and instead the above approach is considered?
Bump!