StoryGAN icon indicating copy to clipboard operation
StoryGAN copied to clipboard

Issue in implementation

Open litcoderr opened this issue 4 years ago • 4 comments

Hi. I was interested in multi-modal video generation tasks and came across your paper.

My issue is that,

  1. I was confused with your code implementation and your model description based on paper.
  • You have 'Story Encoder' in your paper which appears to be a VAE Module. But you are not using them in your code. link to code snippet
m_code, m_mu, m_logvar = motion_input, motion_input, motion_input #self.ca_net(motion_input)
  1. I am having a hard time interpreting your code. For example, what are 'motion features' and 'content features'? There are no specifications in your paper or in your code. If you have time please add some documentation.

  2. Where is this implementation? I might be mis-understanding. If someone understand the implementation please tell me. Thx image

Overall very interesting paper. Thanks

litcoderr avatar Feb 14 '20 04:02 litcoderr

The implementation of (4)-(6) are at model.py line 307-308. For the last step, it is in layers.py. The code is implemented by first getting the hidden states for all time steps, then generate the images at once.

yitong91 avatar Feb 16 '20 20:02 yitong91

By the paper, (4) - (6) seems to be 'Text2Gist' module which gets input of h(t-1) and i(t) from GRU. But in your code,

crnn_code = self.motion_content_rnn(motion_input, content_mean)
  • 'motion_content_rnn' which you referred as a corresponding code for equation (4)-(6), does not take in a GRU ouput of motion_input but a raw motion_input.
  • Also takes in a mean tensor of content, not an output from 'StoryEncoder'(VAE)

Thx

litcoderr avatar Feb 17 '20 03:02 litcoderr

I have the same question, anyone solved?

awkrail avatar Nov 08 '20 12:11 awkrail

I am also confused about the RNN implementation part:

There are two GRU cells defined in the code acting as the two layers in the proposed RNN model. One is a normal GRU (GRU-1) and another (GRU-2) belonging to the Text2Gist cell.

As mentioned in the paper, the GRU-1 take the concatenated sentence and noise as input and outputs i_t: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L356

the GRU-2 in Text2Gist (Equations 4-6) code is mentioned in https://github.com/yitong91/StoryGAN/issues/15#issuecomment-586750565: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L307-L308

however, the GRU-2 code takes the motion_input (or sentences) as input: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L355 which I think is inconsistent with equation (3) and (4) in the paper, where the input for GRU-2 should be i_t (the output from the GRU-1).

Also, for equation (7), the input for Filter should be i_t (output from GRU-1), while in the code, it becomes the output from GRU-2: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L366

So the above is where my confusion comes from, while I might be wrong (correct me if I do). I also wonder if the code is intended for the toy data (i-CLEVR for StoryGAN) or is another version for the "Text2Gist" part?

Any explanations would be appreciated :)

hopeisme avatar Apr 08 '21 00:04 hopeisme