R2R-EnvDrop Questions about Enhanced Speaker

Questions about Enhanced Speaker

Open ZhuFengdaaa opened this issue 5 years ago • 1 comments

You claim an enhanced version of Speaker in section 3.4.3. However, geographic information and actions are only used to calculate the weight of features in attention mechanism.

I have difficulty understanding why g,a are not used to directly calculate the context. Could you provide some works related to the motivation of this design?

Sep 10 '19 03:09 ZhuFengdaaa

Thanks for pointing it out.

I used a trick "fused hidden state" in implementing the attention layer here: https://github.com/airsplay/R2R-EnvDrop/blob/4c115853b6e53dd245f965e99d63579372d7ebdb/r2r_src/model.py#L122.

Mathematically, it would "add" the information of query into the retrieved context vectors:

c   = Att(query, {key})
out = FC([query, c])

Thus, the information of g, a would be captured by the second LSTM.

I am sorry that I forget to mention it in the paper.

Oct 23 '19 00:10 airsplay

R2R-EnvDrop R2R-EnvDrop copied to clipboard

Questions about Enhanced Speaker

R2R-EnvDrop
R2R-EnvDrop copied to clipboard