deepvoice3 icon indicating copy to clipboard operation
deepvoice3 copied to clipboard

Some questions

Open DabiaoMa opened this issue 8 years ago • 2 comments

Hi,

I have read through your implementation of deep voice 3, this is really a very clean one. Have you got any good results yet?

And I have some doubts maybe you could help me clear.

  1. 'modules.py', line 24. Why do we need to make the first row of the embedding matrix to 0 vector?

  2. 'modules.py', line 270. I checked the paper, but I did not find the details about the 'scale' option...

  3. 'moduels.py', line 338, 343. In the paper, It says, 'For a single speaker, ωs is set to one for the decoder and fixed for the encoder to the ratio of output timesteps to input timesteps'. So maybe to the queries, position_rate should be 1, and for keys, position_rate should be hp.T_y/hp.T_x?

  4. 'moduels.py', line 384. I think this line is performing context normalization, and maybe the denominator should be square root of the total input time step, something like sqrt(tf.to_float(val.get_shape()[1]))?

  5. 'synthesis.py', line 38. Maybe the total time step should be hp.T_y//hp.r?

Thanks

DabiaoMa avatar Oct 30 '17 04:10 DabiaoMa

Thanks.

I haven't seen a success yet.

  1. 0's are reserved for paddings. So I wanted to let them have zeros. But I guess if they have values it makes no difference.
  2. I referenced 'Attention is all you need' https://arxiv.org/pdf/1706.03762.pdf
  3. You're right. Technically, the position rate for the encoder should be (T_y//r)/T_x since T_y is reduced by the reduction factor r.
  4. I think you're right.
  5. Yup, already.

Kyubyong avatar Oct 30 '17 08:10 Kyubyong

I would like to implement it in mxnet, but I am still hesitating. Hope you could get good results.

DabiaoMa avatar Nov 01 '17 04:11 DabiaoMa