blog icon indicating copy to clipboard operation
blog copied to clipboard

[w2v-bert] Questions about average duration by token

Open bofenghuang opened this issue 1 year ago • 0 comments

Hi @ylacombe,

Thank you for the new blog post about fine-tuning w2v-BERT.

However, I have some doubts about the "average duration seen by each token", or perhaps I might be mistaken.

The feature extractor employs a hop_length of 160 and a reshape with a stride of 2. Therefore, for a 1-second signal with 16000 samples, it outputs 16000 / 160 / 2 = 50 (actually 48) tokens. This means that each token sees 1000 ms / 50 = 20 ms of signal.

And if we concatenate the encoder with a single conv adapter with a adpter_stride of 2, the 50 tokens get subsampled to 25 tokens, which means that each token sees now 40 ms of signal.

bofenghuang avatar Jan 27 '24 18:01 bofenghuang