monney
monney
Hi! As of right now the only other difference which I think might be important is that the BatchNorm momentum should be set to `0.01` not `0.001` for `0.99` momentum....
@as2626 please see the reference implementation in the google research repo. They do the same thing. In their open implementation they drop cosine similarity for the raw dot product. Then...
@as2626 See the authors note here: https://github.com/google-research/google-research/issues/534 And the Taylor expansion here: https://mathworld.wolfram.com/TaylorSeries.html Plugging in values and rearranging terms will get you the first order approximation.
@zwd973 I can't fully follow your derivation. But this is the formula used in the original code, as stated above. I believe it is correct as is. Here's the derivation:...
@zwd973 You're right, I missed a negative. Interesting. The original author's code is wrong here then
@kekmodel this might be why it got worse? Though Im not sure how the author was able to replicate results
@kekmodel thats unfortunate to hear. But thank you for all your work thus far. The number of discrepancies in the original code make things quite difficult.
> If I am not mistaken, then the first order Taylor expansion goes as > `f(x) = f(a) + f'(a)(x-a)`. > So there is f'(a) instead of f'(x). Then with...
I think that the correct formula is old-new based on the several derivations that have been done here. But, I don't think the MPL Loss really has an effect either...
> @monney Thank you for your valuable insights. I have some follow-up questions regarding your experiments. > > 1. When you say "it beat UDA alone", do you mean "MPL+UDA+large...