dress-code
dress-code copied to clipboard
Warper and HPE training hyperparams
I noticed that the learning for both warper and hpe were fixed during training (100k/50k) iterations according to the paper. With these params I see the loss stagnating after a few thousand iterations.
Have you tried LR decay or different params? Why did you use such a high number of iterations, did you have steady decrease of loss over 50k iterations?