MARS5-TTS
MARS5-TTS copied to clipboard
Long-form generation
I have implemented this simple method to generate long-form content with MARS5. It splits the text into multiple chunks and generates audio for each chunk individually. These are then joined. There are two ways how this could work: (1) Either it can reuse the reference provided by the user sliding_window_reuse_reference = True, or (2) it uses the audio generated for the previous chunk as a reference = False.
Pros of reusing the same reference:
- It is more robust, i.e., if the generation fails in one chunk, it will not affect the other chunks.
- It is feasible to use a short reference, so the inference is faster and you can use longer sliding windows (meaning less splits)
Cons of reusing the same reference:
- The speech is less fluent. For example, if the reference is a sentence, all the generated audios can have an accent on the start of the speech (as the model expects that it is generating next sentence at that point). This is barely noticeable, but in the examples below, it is possible to hear that the
reusesample puts accent on some words.
Examples
The chunks were as follows:
- An advantage of variance as a measure of dispersion is that it is more amenable to algebraic manipulation than other measures
- of dispersion such as the expected absolute deviation; for example, the variance of a sum of uncorrelated random variables
- is equal to the sum of their variances.
- A disadvantage of the variance for practical applications is that, unlike the standard deviation, its units differ from the
- random variable, which is why the standard deviation is more commonly reported as a measure of dispersion once the calculation
- is finished.
Using the previous chunk as reference https://github.com/Camb-ai/MARS5-TTS/assets/9572985/f1675439-2865-44b1-834c-a2b82365644e
Reusing the original reference https://github.com/Camb-ai/MARS5-TTS/assets/9572985/695ed185-a74a-405f-89ff-d016a768eb22
Sliding window size
I added the size of the sliding window as a cfg attribute. It simply counts the number of characters in the input and tries to split the text accordingly. This can be controlled by the user.
Silences
I have lowered the trim_db attribute to be more aggressive. There are however still some silences generated in the middle of the speech. On the other hand, if we join two chunks, they often follow each other abruptly and it would be nice to include some additional silence there. I think a good sound engineer might be able to fix both of these issues.
One way you could make the generation process consistent between chunks is by forcing a part of the previous chunk T onto the next chunk T+1 in the diffusion stage. So let's suppose the chunks are of length 50, on inference you save the last 10 (random number just for the purpose of example) for each step in the diffusion process. Then, when doing diffusion on chunk T+1, you overlap it with chunk T such that the overlap is 10, and on each diffusion step you force the corresponding diffusion outputs from chunk T onto the overlapping region. In this way you are forcing the model to make continuous speech in between chunks, by providing context from the previous chunk.
@Craq That is what is happening when sliding_window_reuse_reference = False. The previous chunk is reused as a reference. I use the entire chunk, as we need to know the transcript as well. We could use just the last X frames, but we would have to match the transcript accordingly (not trivial).
I tried this method and got more consistence results