SOME
SOME copied to clipboard
transcription
Can you please give an example of a transcriptions.csv file with name ph_seq, ph_dur and ph_num in it.
I want to see a reference file.
If you have ever made DiffSinger datasets you should be familiar with transcriptions.csv. See https://github.com/openvpi/MakeDiffSinger, if you haven't done that before and want to learn more details. There is also a link to this SOME repository in https://github.com/openvpi/MakeDiffSinger/tree/main/variance-temp-solution, and you can understand everything once you reach that step.
1- Does this variance temp solution link work for English or French datasets ?
Ok Thanks ,
So if I understand this correctly , if I have ph_seq ph_dur ph_num I can Use SOME to get the midi sequence and midi duration sequence ? if yes I have 2 Questions
1- How can I obtain those 3 ph_seq , _dur, _num.? I saw 2 tools but I'm not sure if they will obtain those 3!
https://github.com/wolfgitpr/LyricFA
https://github.com/Anjiurine/fast-phasr-next
Is there any other tool that will automatically generate me the Phoneme Sequence| Phoneme duration Sequence|Phoneme num?
2- How accurate are the generated midi sequence and midi duration sequence going to be ? like 100% ? ( I'm asking as if it isn't 100%, I think it will make the model hallucinate during SVS inference )
ph_seqandph_durshould be obtained when you finished making your DiffSinger acoustic dataset. Many tools and pipelines can do this. But as far as I know,ph_numcan only be obtained by the method described in MakeDiffSinger repository, and unfortunately, there are no proper method of automaticph_numinference for polysyllabic languages like English and French yet. However, I already have an idea to do this as described in https://github.com/openvpi/MakeDiffSinger/issues/11. If you have some suggestions you can comment on that issue.- The pretrained model of SOME is trained on pure Chinese datasets. Though SOME is language-irrelevant, it may not produce as good results as on its "native" language. But we do benefit from it for reducing the time cost of manual MIDI labeling, because of its ability to recognize slur notes and generate cent-level MIDI values.
does this help ? https://github.com/colstone/ENG_dur_num
Yes, this can help, in some degree. But I doubt if simply specifying all vowels is enough and proper for polysyllabic languages. A more detailed discussion was raised here: https://github.com/openvpi/MakeDiffSinger/discussions/12
I think an examples/transcription.csv is a no-brainer...
The format itself seems to vary based on the method and pipeline.