R2R-EnvDrop
R2R-EnvDrop copied to clipboard
Questions about Enhanced Speaker
You claim an enhanced version of Speaker in section 3.4.3. However, geographic information and actions are only used to calculate the weight of features in attention mechanism.
I have difficulty understanding why g,a
are not used to directly calculate the context. Could you provide some works related to the motivation of this design?
Thanks for pointing it out.
I used a trick "fused hidden state" in implementing the attention layer here: https://github.com/airsplay/R2R-EnvDrop/blob/4c115853b6e53dd245f965e99d63579372d7ebdb/r2r_src/model.py#L122.
Mathematically, it would "add" the information of query into the retrieved context vectors:
c = Att(query, {key})
out = FC([query, c])
Thus, the information of g, a
would be captured by the second LSTM.
I am sorry that I forget to mention it in the paper.