audioContextEncoder
audioContextEncoder copied to clipboard
Use dropout correctly
I added a dropout feature to the sequential model. Preliminary tests on it are a bit hard to asses.
I trained two equivalents networks for 800k steps with a learning rate of 1e-3. In orange there's a network with dropout = 0.3 for the linear layer and 0.1 for all conv and deconv layers except the last deconv. In blue is the same network without any dropout. I think the sudden change in the orange one in the training SNR comes when I restarted the training with dropout = 0.3 for the linear layer (before it was 0.5, I'm not really sure)
It seems to work well since the performance on the validation test is better with dropout and worse on the training set.
What do you think? Should I run more tests? Are this parameters good for you? (30% on the linear layer and 10% on convs)
I also tried the same net w/only dropout=50% on convs (blue):
I will also change the implementation of the dropout to be a little more explicit and descriptive.
I changed it here: https://github.com/andimarafioti/audioContextEncoder/commit/a8208b776af7dd95a18fa77333bf8a14b9d5113f
According to the original paper, dropout should be after the activation (relu).
And in here F. Chollet says the RELU should go before batch normalization.
When I change the learning rate the SNR on validation improves drastically for the network that doesn't have dropout, making it work way better than the one with dropout and also similarly to what it does for training:
I don't know why this effect happens with the learning rate, but it's been happening for a while now. The weirdness is: on blue I added dropout and it did worse. On orange I removed dropout and it did better. Of course dropout is (should) being removed at testing/validating.
Maybe this small network is not able to overfit the training set?
I may be seeing a problem that arises from not having a separate graph for the training and evaluating.
I did some tests setting the dropout to really high values and the performance on the testing set is really affected but not on the validation, so its probably not a matter of having separated graphs
According to the plot, it seems to work. To be discussed in the next meeting.
Please, use 20% dropout before the fully connected layer only.