lucidrains/musiclm-pytorch: Implementation of MusicLM, Google's new SOTA model...

MusicLM - Pytorch (wip)

Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch.

They are basically using text-conditioned AudioLM, but surprisingly with the embeddings from a text-audio contrastive learned model named MuLan. MuLan is what will be built out in this repository, with AudioLM modified from the other repository to support the music generation needs here.

Please join if you are interested in helping out with the replication with the LAION community

Usage

$ pip install musiclm-pytorch

Usage

MuLaN first needs to be trained

import torch
from musiclm_pytorch import MuLaN, AudioSpectrogramTransformer, TextTransformer

audio_transformer = AudioSpectrogramTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64,
    spec_n_fft = 128,
    spec_win_length = 24,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer(
    dim = 512,
    depth = 6,
    heads = 8,
    dim_head = 64
)

mulan = MuLaN(
    audio_transformer = audio_transformer,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch.randn(2, 1024)
texts = torch.randint(0, 20000, (2, 256))

loss = mulan(wavs, texts)
loss.backward()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan.get_audio_latents(wavs)  # during training

embeds = mulan.get_text_latents(texts)  # during inference

To obtain the conditioning embeddings for the three transformers that are a part of AudioLM, you must use the MuLaNEmbedQuantizer as so

from musiclm_pytorch import MuLaNEmbedQuantizer

wavs = torch.randn(2, 1024)
embeds = mulan.get_audio_latents(wavs)

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer(
    mulan = mulan,
    conditioning_dims = (1024, 1024, 1024), # say all three transformers have model dimensions of 1024
    namespaces = ('semantic', 'coarse', 'fine')
)

# now say you want the conditioning embeddings for semantic transformer

conds = quantizer(wavs = wavs, namespace = 'semantic') # (2, 8, 1024) - 8 is number of quantizers

Todo

[x] mulan seems to be using decoupled contrastive learning, offer that as an option
[x] wrap mulan with mulan wrapper and quantize the output, project to audiolm dimensions
[ ] modify audiolm to accept conditioning embeddings, optionally take care of different dimensions through a separate projection
[ ] audiolm and mulan goes into musiclm and generate, filter with mulan
[ ] add a version of mulan to open clip
[ ] set all the proper spectrogram hyperparameters
[ ] email some contrastive learning experts and figure out why some papers are sharing the projection from embeddings to latent space
[ ] improvise a bit and give the audio transformer a position generating module before each attention layer

Appreciation

Stability.ai for the generous sponsorship to work and open source cutting edge artificial intelligence research

Citations

@inproceedings{Agostinelli2023MusicLMGM,
    title     = {MusicLM: Generating Music From Text},
    author    = {Andrea Agostinelli and Timo I. Denk and Zal{\'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank},
    year      = {2023}
}

@article{Huang2022MuLanAJ,
    title   = {MuLan: A Joint Embedding of Music Audio and Natural Language},
    author  = {Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2208.12415}
}

The only truth is music. - Jack Kerouac

musiclm-pytorch
musiclm-pytorch copied to clipboard

Metadata

MusicLM - Pytorch (wip)

Usage

Usage

Todo

Appreciation

Citations

← Metadata

Owner

Metadata

musiclm-pytorch musiclm-pytorch copied to clipboard

Metadata

MusicLM - Pytorch (wip)

Usage

Usage

Todo

Appreciation

Citations

← Metadata

Owner

Metadata

musiclm-pytorch
musiclm-pytorch copied to clipboard