diart icon indicating copy to clipboard operation
diart copied to clipboard

Why the sliding windows are so tight in the diariazation process?

Open ywangwxd opened this issue 11 months ago • 3 comments

Luckily, I have integrated faster whisper successfully into the diart-spk branch. Maybe I will submit a PR later.

But I have a question about the sliding windows in diariazation. I used the default step parameter, set 5s as the duration for diariazation and 5s for ASR respectively. I found the sliding windows features passed into the __call__ function of the SpeakerAwareTranscription pipeline are very dense. They look like this:

Segment(0, 5)
Segment(0.5, 5.5)
Segment(1, 6)
Segment(1.5, 6.5)

There are too much overlapping between two consecutive windows. Even if I set batch size 32 to the diariazation process, the effective audio length for ASR is only 31*0.5+5=20.5s. This also makes the diariazation process much less efficient since there are two much redundant computation between two windows. Do I understand the underlying logic correctly? Should I assign a large value to the step parameter?

ywangwxd avatar Dec 20 '24 01:12 ywangwxd

Luckily, I have integrated faster whisper successfully into the diart-spk branch. Maybe I will submit a PR later.

But I have a question about the sliding windows in diariazation. I used the default step parameter, set 5s as the duration for diariazation and 5s for ASR respectively. I found the sliding windows features passed into the __call__ function of the SpeakerAwareTranscription pipeline are very dense. They look like this:

Segment(0, 5)
Segment(0.5, 5.5)
Segment(1, 6)
Segment(1.5, 6.5)

There are too much overlapping between two consecutive windows. Even if I set batch size 32 to the diariazation process, the effective audio length for ASR is only 31*0.5+5=20.5s. This also makes the diariazation process much less efficient since there are two much redundant computation between two windows. Do I understand the underlying logic correctly? Should I assign a large value to the step parameter?

With the above doubts, I have tried setting step=4.5 and duration=5.0, e.g., with overlapping of 0.5 seconds. I have not found the SpeakerAwareDiariazation results getting worse. But it will be much more fast.

ywangwxd avatar Dec 20 '24 09:12 ywangwxd

Hi @ywangwxd, this kind of sliding window is made for the diarization pipeline, but if I remember correctly, in my blogpost about combining whisper and diart I used non-overlapping 2s windows to do this, so basically the window had to be readjusted down the line. Otherwise you get duplicate captions

juanmc2005 avatar Dec 21 '24 16:12 juanmc2005

Hi @ywangwxd, this kind of sliding window is made for the diarization pipeline, but if I remember correctly, in my blogpost about combining whisper and diart I used non-overlapping 2s windows to do this, so basically the window had to be readjusted down the line. Otherwise you get duplicate captions

So you mean, there is no need to have any overlapping between two consecutive sliding windows for diariazatio at all? I know there is no need to have overlapping for ASR. But as you mentioned above, originally you need some overlapping for diarization. Then why it is not needed anymore when comibing diariazation and ASR? I found the parameters of duration and step are totally independent to each other for diariazation and ASR. In ASR, I do not need to specify step parameter because it was set as the same as duration (for asr) internally.

ywangwxd avatar Dec 23 '24 03:12 ywangwxd