VGGVox
VGGVox copied to clipboard
about the pool_time layer
confused with pool_time layer. as said 'modify the payers to adapt to the spectrogram'. a input size of 512*300 with 3s segment, the resnet50 output 9* 8*2048d, and followed with 9*1*2048 fully connect layer. How does the 1*N avg pool layer work? this 9*1*2048 length fc layer has nothing to do with N. It can be followed by the fc2(5994) layer to the output... plz....
the 9*1*2048 fc1 layer would output a feature of size 1*8*2048 if fed with an input of size 9*8*2048, N=8 in this situation