PyTorch_Speaker_Verification icon indicating copy to clipboard operation
PyTorch_Speaker_Verification copied to clipboard

What is the duration of audio of each D vector embedding that is created?

Open abhilashnayak opened this issue 5 years ago • 0 comments

Hi,

Thanks for this work. I am using the output of dvector_create.py as input to uis-rnn. Diarization is also done.

But I have a small confusion on the number of d vector embeddings created. dvector_create.py created 24 embeddings for 9.7 sec audio and 21 embeddings for 8.9 sec audio. In the first case, if I consider every embedding is related to 240 milliseconds (just an assumption) of audio and add up , it does not give the full audio duration. 24 * 240 = 5760 (5.7 seconds). But my audio file is 9.7 seconds long.

Just wanted to understand this as I need to split the audio after diarization is done. The idea is, if diarization result says that first 10 embeddings are related to speaker1 and if I also know each embedding is X ms long, then 10 * X= 10X ms (10X/1000 seconds) . So I will split the audio after 10X ms seconds and so on. So without knowing from what time frame(in milliseconds) to what timeframe speaker 1 spoke and what are the timeframes for speaker 2 , I cannot split the audio.

Please help me understand this. Also is there any other way that you can suggest to split the audio.

abhilashnayak avatar Dec 13 '19 10:12 abhilashnayak