Minjie Xu

Results 3 issues of Minjie Xu

Depends on #4 I found #4 to be kinda sufficient to get aspect 0 training stably (even with batch size 1024), but not so much for aspect 1 and 2....

1. only update lambda **during training** 2. use **separate** moving-averages for `train` vs. `eval` (similar to batch-norm I guess?) I find 1 to be **crucial** for stabilizing the training under...

Per title - as the augmented Lagrangian is a **minimax** problem (i.e. min w.r.t. model parameters, yet max w.r.t. lambdas), it doesn't really make sense to always prefer the lower...