jukebox
jukebox copied to clipboard
How do you input the accurate lyrics alignment obtained by "NUS AutoLyricsAlign" into Jukebox?
It can be found at https://openai.com/blog/jukebox/. as follows.
"we use Spleeter32 to extract vocals from each song and run NUS AutoLyricsAlign33 on the extracted vocals to obtain precise word-level alignments of the lyrics. We chose a large enough window so that the actual lyrics have a high probability of being inside the window."
However, when I look at the source code on GitHub, I see that it uses heuristics to input lyrics into Jukebox, and the program described above does not seem to be implemented.
Can you please disclose the source code to input the lyrics exactly as described above?
I am also curious about this
I'm staring at this right now. Spleeter and AutoLyricsAlign would need to be run as a preprocessing step on the dataset.
Spleeter can run from a docker container.
AutoLyricsAlign runs with "apptainer" instead. Its scripts are finnicky, will refuse to work if you have white space in your filenames, etc, but nonetheless, you can end up with a text file full of this sort of things:
36.06 36.45 PUT
36.45 36.81 YOUR
36.81 37.44 WHITE
37.44 38.49 TENNIS
38.49 39.03 SHOES
39.03 39.39 ON
39.39 39.63 AND
39.63 40.41 FOLLOW
40.41 41.13 ME
And at that point, it's a matter of mapping those timestamps to whatever shape jukebox consumes.
I'm still fuzzy on what the code here does, but the y
vector returned by labeller::get_label is probably relevant. Maybe tweaking get_relevant_lyric_tokens
could be enough, or maybe we'd need a variant of the Labeller class in data/labels.py
.
I think one way to do this is to have a script that merges the output of AutoLyricsAlign with the reference lyrics, inserting timestamp in them in a way that can co-exist peacefully with the lyrics. Something like
{36.12}Put {36.45}your {36.81}white {37.44}tennis {38.49}shoes {39.03}on {39.42}and {39.60}follow {40.44}me
{41.97}Why {42.60}work {43.26}so {43.65}hard {44.49}when {45.00}you {45.06}could {45.54}just {45.90}be {46.14}free?
The merge process needs to be fuzzy as AutoLyricsAlign isn't perfect, transcribes unrecognized words as "BREATH*", misrecognizes words, may have differing opinions about representing words ("seven" vs "7", "hopping" vs "hoppin'", etc.), may miss background vocals (Spleeter shares some blame there), and occasionally miscounts words.
I initially naively assumed that since it is given reference lyrics to work with, it'd optimistically place them in its output to complement its own transcription, but that's not the case.
(I looked into replacing it with Whisper, but even with the newer word-level transcription bits, it's not a reliable drop-in replacement. Still, I suspect it could become the better path with more work.)
The rationale for going through that hacky merge step is based on the assumption that the original models were trained on reference lyrics rather than all-caps punctuation-free lyrics, and therefore fine-tuning them using the full range of lyrics tokens rather than collapsing them to [A-Z ]
should yield better results.
Then we'd need to add methods in TextProcessor
to consume that. Something similar to clean()
would parse out the timings and associated text and store them in a dictionary, while something similar to tokenize()
would take track length information and return a large zeroed array matching the total number of tokens for the track, sparsely filled with each bit of lyrics. Then the labeller could use that array instead of tokenize()'s return value.
At that point, in theory (I haven't gotten that far yet), get_relevant_lyrics_token
could be left unchanged and would amount to picking a sliding window of lyrics tokens matching the offset given at 1:1 scale.
Things seem to be working as described above. Here's what a ~6 second slice obtained from get_relevant_lyric_tokens
looks like:
n_tokens=384 total_length=12623772 sample_length=262144 full_size=18492
get_relevant_lyric_tokens return a window of [23.777233560090703 => 29.72154195011338]
CHUNK (0,1024*1024) is: file=/home/itsnotlupus/datasets/audio/lana/wav/01.-Born To Die.wav
y=[12623772 1048576 262144 1793 14 0 0 0
0 0 0 0 0 6 31 31
46 77 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 30
41 40 71 46 77 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
32 27 35 38 77 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 39 31 77 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
40 41 49 79 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 20 27
37 31 77 0 0 0 0 0
0 0 0 0 0]
describe_label(y)={'artist': 'lana_del_rey', 'genre': 'pop', 'lyrics': "Feet don't fail me now\nTake "}
Each word of those lyrics is inserted at the correct starting spot in that vector.
This doesn't account for the time it takes to utter each word (ie I don't insert zeroes in between characters of a given word), but without knowing how the lyrics were shaped in the original training was done, it's a plausible guess.
I'm going to fine tune this for a while and see what I get.
you’re doing the god’s work
On Wed, Jan 11, 2023 at 10:29 PM itsnotlupus @.***> wrote:
Things seem to be working as described above. Here's what a ~6 second slice obtained from get_relevant_lyric_tokens looks like: n_tokens=384 total_length=12623772 sample_length=262144 full_size=18492 get_relevant_lyric_tokens return a window of [23.777233560090703 => 29.72154195011338] CHUNK (0,1024*1024) is: file=/home/henri/datasets/audio/lana/wav/01.-Born To Die.wav y=[12623772 1048576 262144 1793 14 0 0 0 0 0 0 0 0 6 31 31 46 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 30 41 40 71 46 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 32 27 35 38 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 39 31 77 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 40 41 49 79 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 27 37 31 77 0 0 0 0 0 0 0 0 0 0] describe_label(y)={'artist': 'lana_del_rey', 'genre': 'pop', 'lyrics': "Feet don't fail me now\nTake "}
Each word of those lyrics is inserted at the correct starting spot in that vector. This doesn't account for the time it takes to utter each word (ie I don't insert zeroes in between characters of a given word), but without knowing how the lyrics were shaped in the original training was done, it's a plausible guess. I'm going to fine tune this for a while and see what I get.
— Reply to this email directly, view it on GitHub https://github.com/openai/jukebox/issues/244#issuecomment-1379765977, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASXCZOJZYDGETSMP3FWYJTWR53B5ANCNFSM5EXHPIJA . You are receiving this because you commented.Message ID: @.***>