wav2letter
wav2letter copied to clipboard
Forward long audio
Due to the limit of GPU memory, I have to split a long audio into many chunks before runing forward. I run network->forward and recieve a rawEmisson for each chunk. After runing forward for every chunks, I concat all rawEmisson and run decode with LM.
My problem is that I get the difference result if i split audio.
More details:
- I use an 10.27s audio, when I don't chunk, i got a
rawEmissonwith dismention (N, T) = (139, 520). But as my calculation, number of timestep T = 10.27 * 1000 / 10 / 2 = 513 --> It seem like the code padding more 7 timestep. - To check that, I also split the audio into 2 part, ran forward then concat the
rawEmisson, I got dismention (N, T) = (139, 527). --> 527 - 513 = 14 timestep --> each time i runnetwork->forward, the code'd pad more 7 timestep to therawEmisson
I debug and have no idea where the padding is and how to remove the padding to make the splitting works well.
Hi, Can you also post your network architecture here so that we can verify padding.
Here is my network architecture: (0): View (-1 1 40 0) (1): Conv2D (40->1024, 8x1, 2,1, SAME,SAME, 1, 1) (with bias) (2): ReLU (3): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (4): ReLU (5): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (6): ReLU (7): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (8): ReLU (9): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (10): ReLU (11): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (12): ReLU (13): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (14): ReLU (15): Conv2D (1024->1024, 8x1, 1,1, SAME,SAME, 1, 1) (with bias) (16): ReLU (17): Reorder (2,0,3,1) (18): Linear (1024->1024) (with bias) (19): ReLU (20): Linear (1024->139) (with bias)