motion-diffusion-model We don't have to encode_text in each denoise timestep (but compute them at the beginning), do we?

We don't have to encode_text in each denoise timestep (but compute them at the beginning), do we?

Open RalphHan opened this issue 1 year ago • 3 comments

https://github.com/GuyTevet/motion-diffusion-model/blob/8139dda55d90a58aa5a257ebf159b2ecfb78c632/model/mdm.py#L151C8-L151C8

class MDM(nn.Module):
......
    def forward(self, x, timesteps, y=None):
        """
        x: [batch_size, njoints, nfeats, max_frames], denoted x_t in the paper
        timesteps: [batch_size] (int)
        """
        bs, njoints, nfeats, nframes = x.shape
        emb = self.embed_timestep(timesteps)  # [1, bs, d]

        force_mask = y.get('uncond', False)
        if 'text' in self.cond_mode:
            enc_text = self.encode_text(y['text'])
            emb += self.embed_text(self.mask_cond(enc_text, force_mask=force_mask))

Aug 28 '23 11:08 RalphHan

Yes, that's possible at inference (but not in training) and can accelerate performance. If you are interested, you can send us a pull request.

Aug 28 '23 15:08 GuyTevet

Yes, that's possible at inference (but not in training) and can accelerate performance. If you are interested, you can send us a pull request.

@GuyTevet I send a PR here: https://github.com/GuyTevet/motion-diffusion-model/pull/152/

Aug 29 '23 09:08 RalphHan

Why is it not possible during training ?

Apr 23 '24 16:04 leob03

motion-diffusion-model motion-diffusion-model copied to clipboard

We don't have to encode_text in each denoise timestep (but compute them at the beginning), do we?

motion-diffusion-model
motion-diffusion-model copied to clipboard