blog
blog copied to clipboard
[w2v-bert] Questions about average duration by token
Hi @ylacombe,
Thank you for the new blog post about fine-tuning w2v-BERT.
However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.
The feature extractor employs a hop_length of 160 and a reshape with a stride of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.
And if we concatenate the encoder with a single conv adapter with a adpter_stride of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.