transflower-lightning
transflower-lightning copied to clipboard
Implement jukebox
Implement a version of jukebox applied to the task of motion prediction.
We could start with just a single level in the hierarchy, so that we basically implement a VQ-VAE for a "1-dimensional image" corresponding to a window of motions of a certain size.
Then we train an autoregressive transformer to predict the VQ-VAE latent tokens (which encode poses), conditioned on music (in the same way the current multimodal transformer works).
Could use dVAEs (as in DALL-E)