Are there plans to train a 44.1kHz stereo model?
If not, what architectural changes would you suggest to be able to do so?
Would I need to scale the model size? What scale do you think would provide indistinguishable results?
Not what you're asking, but fyi anyway, that I've added stereo simulation and 44.1 kHz conversion to the Colab notebook. If you set stereo_width > 0, it will generate a second audio file by style-transferring the same prompt to the first generated audio in low strength, then mash those to left and right channels with the set stereo width. Conversion to 44.1 kHz happens with ffmpeg's default interpolation filter (afaik -dither_scale 1 -resampler 'swr' -filter_size 32 -phase_shift 10), which usually makes the audio sound a tiny bit better. These little tricks are not much, but I find the results much more enjoyable than 16 kHz mono anyway.
UPDATE: I've added some more post-processing for audio quality enhancement to the Colab notebook.
https://github.com/haoheliu/AudioLDM/assets/50331907/5ecdd632-34e2-4f13-9681-2f8e3aa28777
This is a great question. I think 44.1khz (even mono) would allow immediate real world deployment.