diart
diart copied to clipboard
quality concerns
It looks like pipeline quickly forgets previous speakers, assigning wrong tags to new ones, so that a conversation of 4-5 people being inferenced as a conversation of 2.
I am testing alongside with whisperx, which seem to be using same set of default models, though gives better results.
Before diving deeper into the debugging, is there an obvious number of things I could be doing wrong? I tried non-default embedding model with same result.
@DmitriyG228 you can check out other related issues like #4, #133 and #226 where this was already discussed