StoryGAN
StoryGAN copied to clipboard
Issue in implementation
Hi. I was interested in multi-modal video generation tasks and came across your paper.
My issue is that,
- I was confused with your code implementation and your model description based on paper.
- You have 'Story Encoder' in your paper which appears to be a VAE Module. But you are not using them in your code. link to code snippet
m_code, m_mu, m_logvar = motion_input, motion_input, motion_input #self.ca_net(motion_input)
-
I am having a hard time interpreting your code. For example, what are 'motion features' and 'content features'? There are no specifications in your paper or in your code. If you have time please add some documentation.
-
Where is this implementation? I might be mis-understanding. If someone understand the implementation please tell me. Thx
Overall very interesting paper. Thanks
The implementation of (4)-(6) are at model.py line 307-308. For the last step, it is in layers.py. The code is implemented by first getting the hidden states for all time steps, then generate the images at once.
By the paper, (4) - (6) seems to be 'Text2Gist' module which gets input of h(t-1) and i(t) from GRU. But in your code,
crnn_code = self.motion_content_rnn(motion_input, content_mean)
- 'motion_content_rnn' which you referred as a corresponding code for equation (4)-(6), does not take in a GRU ouput of motion_input but a raw motion_input.
- Also takes in a mean tensor of content, not an output from 'StoryEncoder'(VAE)
Thx
I have the same question, anyone solved?
I am also confused about the RNN implementation part:
There are two GRU cells defined in the code acting as the two layers in the proposed RNN model. One is a normal GRU (GRU-1) and another (GRU-2) belonging to the Text2Gist cell.
As mentioned in the paper, the GRU-1 take the concatenated sentence and noise as input and outputs i_t: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L356
the GRU-2 in Text2Gist (Equations 4-6) code is mentioned in https://github.com/yitong91/StoryGAN/issues/15#issuecomment-586750565: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L307-L308
however, the GRU-2 code takes the motion_input (or sentences) as input: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L355 which I think is inconsistent with equation (3) and (4) in the paper, where the input for GRU-2 should be i_t (the output from the GRU-1).
Also, for equation (7), the input for Filter should be i_t (output from GRU-1), while in the code, it becomes the output from GRU-2: https://github.com/yitong91/StoryGAN/blob/6172f8a11d80ae5cbcd55234cb490a154cadde0e/code/model.py#L366
So the above is where my confusion comes from, while I might be wrong (correct me if I do). I also wonder if the code is intended for the toy data (i-CLEVR for StoryGAN) or is another version for the "Text2Gist" part?
Any explanations would be appreciated :)