Atabak Pouya
Atabak Pouya
The shape of TGRU's input(x9) is (Time, 16, 64). Since it should aggregate the information along the time-axis and batch_first=True in your implementation, therefore the input of TGRU should have...
@amirpashamobinitehrani The input shape for 1D conv is: (T, C,F) (Time frames, Channels(4 features), Frequency bins).
Correct!Each frame is a data sample here. If you want to use the (Batch, Time, Features, Frequency) you should use 2D Convolution and set the filters’ dimension to (n, 1).
> Hi, > > I had the same question. Has anyone been able to successfully train this network? I think that as @atabakp mentioned, the input has to have shape...
> There are a few methods to do this, but I don't know what the Authors exactly mean. for example https://arxiv.org/pdf/1608.01953.pdf But for my training, I used Log Magnitude and...
> Thanks once again @atabakp! I was thinking something similar: > > 1. Use log magnitude (as in the paper) > 2. Use PCEN output (as in the paper) >...
Section 3 of this paper also has some information about phase demodulation: https://www.isca-speech.org/archive_v0/Interspeech_2018/pdfs/1773.pdf
> > I also have a question about the TGRU along the same lines. According to the paper: > > > The decoder is composed of a Time-axis Gated Recurrent...
> Hi @atabakp , > > Not sure if my interpretation of the outputs is correct, but I'm trying to follow the paper and even when the model trains, it...
> Hi again @atabakp , > > When training the model, are you using 2s audio as the paper claims or are you using gradient accumulation or something like that...