Minjie Xu
Minjie Xu
Depends on #4 I found #4 to be kinda sufficient to get aspect 0 training stably (even with batch size 1024), but not so much for aspect 1 and 2....
1. only update lambda **during training** 2. use **separate** moving-averages for `train` vs. `eval` (similar to batch-norm I guess?) I find 1 to be **crucial** for stabilizing the training under...
Per title - as the augmented Lagrangian is a **minimax** problem (i.e. min w.r.t. model parameters, yet max w.r.t. lambdas), it doesn't really make sense to always prefer the lower...