echogarden
echogarden copied to clipboard
Feature Request: Rate-independent audio search/match
Howdy! I'm the developer of Storyteller, which is a platform for automatically syncing ebooks and audiobooks. I'm interested in migrating to echogarden for the narration syncing, but I was wondering if anyone would be able to talk through some points with me, first.
I guess, first, some context. Because ebooks and audiobooks don't always have exactly the same content, and their contents aren't always in the same order (e.g. meta content, such as dedications, are often at the front of the ebook, but at the end of the audiobook), Storyteller has a multi-phase syncing process:
- Transcribe the entire book to text (using Whisper)
- For each text chapter, use fuzzy matching to find the place in the transcript where it starts
- Starting at that offset, use a custom windowed fuzzy matching algorithm to attempt to find the start and end timestamps of each sentence in the chapter.
In an ideal world, we would be able to simply treat the ebook text as the "transcript" and use echogarden's align function to find the start and end timestamps for each sentence, but the ordering issue mentioned above I think precludes this approach. I would still be happy to switch to echogarden for just the transcription step, but I figured I might reach out first and see if anyone had any ideas about how I might be able to use echogarden's alignment functionality to replace the entire sync process, including the chapter ordering step.
Either way, thank you for this project, it's going to be a lifesaver for me!
If the default DTW-based alignment engine is applied entire text and audio, then it's likely to get confused by areas of text that are not included in the audio at all (and vice versa).
They way it works is that it synthesizes the text using eSpeak and then takes the two audio sequences and tries to map, for every synthesized audio frame (encoded as its first 13 Mel Frequency Cepstral Coefficients), one or more source audio frames.
So, if some range of the synthesized audio doesn't have any equivalent in the source audio then it will still match it to a range of samples, although due to the dissimilarity it is likely that that it would match it to comparatively short time range, and then quickly advance to better matching ranges.
The Dynamic Time Warping algorithm is a form of sequence alignment algorithm. It is designed for temporal sequences that are assumed to have similar content, but with different rates, and where the rate can internally vary to become higher or lower (different parts of the time series are "stretched" or "contracted" relative to each other).
The algorithm is very fast since it doesn't use any speech recognition, or actually any machine-learning algorithm at all.
The other alignment engines, dtw-ra and whisper, also primarily work on the assumption that the text and audio contain the same content, although they do use speech recognition and are intended for more difficult audio where there are elements like loud background noises or music.
The whisper alignment engine (using a kind of forced decoding over the transcript) is very robust to music and other noises, and can be used for Karaoke alignment. It can actually skip ranges of non-speech (for example a short instrumental part), but when it encounters speech, it will nearly always match to it. Unlike DTW, it works sequentially and doesn't have a global understanding of the entire audio, beyond the 30 second chunks it works on.
What you are describing is not only an alignment problem but also includes a kind of search element to it. You want to isolate ranges of text and audio that match each other. This toolset doesn't currently have something that is designed to do exactly that.
If you can isolate individual chapters as contiguous ranges of text and audio, then you can apply the alignment to those ranges individually.
To find the time offsets for different chapters, maybe it's possible to use other techniques, maybe even ones that don't use speech recognition at all.
We could synthesize the text, extract audio features for both the source and synthesized text, and then use a kind of scanning procedure to find the best matching offset for, say, a few seconds of the start and end of the synthesized segment for that part of the text.
The thing is that this search-oriented kind of audio feature matching also has to work independently of the rate of the speakers (real and synthesized), but because we are only matching prefixes, and not entire sequence ranges, DTW may not be the optimal approach.
It sounds more like something that would benefit from a kind of rate-independent acoustic fingerprinting. Basically extract a fingerprint of the first, say, 10 or 20 seconds of the synthesized audio, and then sequentially scan the source audio to match that fingerprint to its features (more optimized for multiple searches: use some sort of approximate nearest neighbors matching for the fingerprints into a pre-computed "database" derived from the source audio).
I'm actually working on developing a form of acoustic fingerprinting method right now, but a non-typical one - not for audio or speech search (more like for audio compression, which is a very different thing).
Anyway, using acoustic fingerprints to match "landmark" locations in the synthesized audio, into the source audio, could be an alternative that is faster (possibly much faster) than using recognition. Then, applying the dtw alignment to the individual matched segment would be much, much faster than using recognition, so overall it produce an efficient, ML-free approach.
Thank you so much for taking the time to think through this with me! Audio fingerprinting is a great idea for the search step, I'm going to start there and see if I can get that working.
I investigated a bit about acoustic fingerprints for speech and for varying rates. Seems like most fingerprinting methods are for music or sound effects, not specialized for speech, and are not rate independent.
During a conversation with Claude 3.5 Sonnet about this I had several ideas.
I suggested various approaches, like using "beam search" as an alternative to DTW (more like a hybrid of beam search and DTW-like matching). Later, it turned out what I was going towards also resembles to be variant of DTW called "subsequence DTW" that may be helpful for this kind of search task.
Basically, standard DTW always "anchors" the first frames of the two sequences to be matched together (represented by top-right of the accumulated cost matrix), as well as the last frames of the two sequences to be matched together (bottom-right of the accumulated cost matrix).
However, we don't necessarily have to. Since the first stage of DTW produces an accumulative cost matrix, its last column contains the cost of all partial matches between the target sequence (the audio "snippet" we are trying to match) and the candidate audio segment (which may be longer or shorter than the snippet).
So we if we sequentially scan the audio to try to match overlapping candidate segments, and take a duration of, say, 2x or 3x of the duration of the snippet we are searching for (to allow for slower speech, to be matched), and compute the accumulative cost matrix between the snippet and each segment, then its last column would contain the costs of the best paths for all prefixes of the segment (the first frames are still anchored together, only the last ones aren't). The actual paths can be found by "backtracking" from the final cell to the start cell - the same way it's done in standard DTW (only usually only from the bottom-right cell).
I think this could become a general purpose audio search method, though at this point I don't exactly know how fast it's going to be. With low granularity for the MFCC features, and relatively short snippets, it may be fast enough to be practical.
Since I already have efficient code to compute the audio features and accumulative cost matrix, I could try to write an experimental audio search method reusing some of it. I'm currently interested in fingerprinting and matching in general, so this came at the right time.
For actual acoustic fingerprints, I had some other ideas, but they aren't going to be as accurate as this. They can maybe used as "rough", initial screening methods if they have low likelihood of false negatives (false positives are fine for initial screening).
It's been a long time since I thought hard about audio processing/analysis (though music information retrieval was my first area of expertise!), so I don't know how much help I'm going to be in the actual planning or execution of such an implementation, but... this sounds fantastic! I would be endlessly grateful to you for putting the time into this. Please don't hesitate to reach out if there's any way that I can help, even if that just means testing some experiment or being a rubber duck!
Hey @rotemdan! I wanted to check in on this; not to rush you, but just to see how it's going! It seems like a challenging problem haha. Do you still feel like you have a clear path forward?
I haven't got to this yet. I have other things I'm currently working on for the next release. For example I'm working on adding machine translation support (Google Translate and Facebook Research NLLM model) and hybrid speech-to-text translation that would add support for many target languages (recognize in native language -> translate text -> align timeline to translation). I'm not done with that yet.
The algorithm itself is pretty clear to me, though for practical implementation it requires dedicated code for the DTW variant it's using:
Say we have n frames in the snippet and m frames in the audio we are searching,
- First compute cost matrix which is
O(m*n). That can be done in reasonable time (similar time scale of existing DTW alignment operations). - Then, for each subsequence of the searched audio, of length
n*scale(wherescaleis something like2or3) compute an accumulated cost matrix and take the rightmost column (most likely add its content to a top-k priority queue). The accumulated matrix computation doesn't include any distance calculations, it's just walking through the cost matrix and selecting directions, so it should be fast in practice.
The overall number of iterations in the algorithm is slow on paper: O(m * (n * (n * scale))) = O(m * n^2 * scale). That makes it hard for me to predict how fast it would run on very long audio and snippets. It is parallelizable though, but that's not something I'm doing right now.
In practice, it's possible this algorithm could produce results that are good enough by themselves, or for more accurate results, be used as a screening stage before applying a speech recognition engine on the top candidate audio ranges found. Using forced decoding (like I do in the whisper alignment engine) I could then extract probabilities from the recognizer to score each candidate (basically forcing Whisper to decode the tokens of the snippet and observing the probabilities it assigns to them).
Getting all of this working and optimized would take some effort. Especially adding an extra recognition phase. I'm definitely interested to see how this would perform in practice, but I first must finish the other priorities.
Thanks for all of that explanation, I completely understand that you have higher precedence stuff to work on first, and that this is a big project!
I think my plan will be, in the meantime, to attempt to migrate over to echogarden, but continue using the current approach of text-based chapter search. It will still be a huge win to have support for languages other than English, and I suspect echogarden's alignment implementation is more performant than my current, homebrewed approach. This will make it a smaller step to move to audio-based search once it's available, too!
Thanks again for all your work and support here!
Hey @rotemdan! Hope all is well. Is this still something you plan on working on? I think it could be a huge deal for Storyteller! But I also genuinely understand if it's not at the top of your priorities. I just thought I would check in!
Actually the published code does include a basic source file for search, but it's just a "stub". Doesn't contain anything.
Over the past months I prioritized mostly fixing issues and did a lot of work of things I felt were more important: like a new audio playback library, a new text segmentation library (not integrated yet), breaking up some of the code to separate packages, and recently adding the Kokoro TTS engine (an independent port based on the published ONNX models directly).
I haven't really went beyond what I described here in investigating how to implement the search. The cost of needing to go through each potential prefix, may mean it being slower than standard DTW but I can't know for sure without testing it. Also, it's possible the matches it finds are too imprecise, and further processing with a more expensive model would be effectively required (but I don't know for what degree). I can't know for sure. It's a kind of project that includes a risk that some effort may not produce a usable outcome. And, if the end result would be too slow, I wouldn't really consider it a usable feature.
In general, I have some doubts whether simple cumulative distance metrics over basic MFCC audio features would be able to reliably find accurate audio matches. Maybe if the reference was a higher quality TTS engine, or a voice closer to the one searched on. A part of the reason DTW sort of works for alignment is that it performs a kind of "constraint satisfaction" search, where it tries to find a mapping between sequences that evaluates many "compromises", and is guided by a kind of global constraints - which means an entire sequence has to be mapped to an entirely different sequence, to narrow down on the best path. One you try to do something similar with search, you don't really have the global constraint anymore, since the mapping is not constrained to a particular length. I can't really evaluate this "hunch" without testing, though.
In essence, I think it's a good future direction, I want to keep it open, and I would very likely work on it as some point (which could be in months or years though), and it may actually work, but this wasn't really one of my original goals in starting this project.
I would be less hesitant to work on, say, higher quality text-to-speech, real-time speech recognition etc. when the opportunity arises. I'm always trying to weigh the possible development cost / benefit of a particular feature, and, for example, when Kokoro TTS got published a few months ago I quickly realized the amount of work required to implement support for it, compared to the outcome quality, wouldn't be that high, given the tools I already had internally (especially the already developed internal grapheme-to-phone methods), and the results were mostly guaranteed. There wasn't really a chance the work would be wasted.
Thanks for the response! This makes total sense to me — I can empathize with the need to balance risky, unknown features against known improvements!
One you try to do something similar with search, you don't really have the global constraint anymore, since the mapping is not constrained to a particular length.
This seems like a potential issue to me, too!
Anyway, glad to hear it's something you're still interested in for the future, and no worries on my end if that's still a somewhat distant future. Storyteller is still benefitting greatly from all of the work you do on echogarden — thank you as always for this awesome library!