Question on Figure

quick question on this figure in the blog post: i know coconet is its own model that will generate subsequent melodies given the input midi file. however, should i decide to train midi ddsp, will the training of coconet also be a part of this? or should i expect a monophonic midi melody as input and the generated audio as output.
thanks for all the help and this awesome project
Hi! Thanks for your interest! Yes, the latter. MIDI-DDSP takes in a monophonic midi melody as input and the generated audio as output.
thank you so much for your prompt response. for training, should the output be the same melody as the midi input? meaning if i want to train on a new instrument i need the midi transcription
Yes. You need paired MIDI and Audio data to train MIDI-DDSP. MIDI-DDSP currently does not support training on dataset other than URMP, so you might need some hack to do so. Last, Audio-MIDI alignment quality will affect the generation quality of MIDI-DDSP as the extraction of the note expression relies on the note boundary.
I see. Thank you!
How "accurate"/reliable was URMP in alignment quality? Also, do you use certain metrics used to measure and assess alignment quality?
I don't have a metric of the alignment quality, but the MIDI (note boundary) in the URMP dataset is manually labeled. So I manually checked the MIDI alignment with the audio, and empirically I found the URMP dataset has a very good alignment quality.
Thanks for all of your help. I would love to help out and improve the repository in any way. How difficult do you think it be to allow training on arbitrary datasets?
Well... I gotta confess because this codebase is not well-written (by myself), so you will need some hacks. Here are some steps you should do:
- Write data preprocess code or dataloader for synthesis generator: write a code that will transform (midi files + audio files) -> tfrecord, with the same format of key and value in here. Note that there are two types of dataset, one is "batched", meaning the data is chunked into 4s of samples. The other one is "unbatched", meaning there is one sample per audio recording. However, you could also try to write your own data loader.
- Once your tfrecord is in the same format as the URMP's the dataset dump code and dataloader for the expression generator and should work fine, otherwise you need to hack with those codes and come up with your own dataloader and dataset dump.
- You need to come up with a way to scale the note expression control so that they approximately are between [0,1]. Here I come up with my own scaling coefficients. You can simply scale them to have unit variance. But be aware you should not clip the value.
If all the above works, then I believe it can run on arbitrary datasets. This is on my todo list, but I do not have the throughput to do so :(. Good luck about that!