MeZO
MeZO copied to clipboard
Impact of Dropout?
Hello and thank you for sharing this interesting approach.
I have one question regarding dropout. If I understand the published code correctly, MeZO was tested having dropout deactivated, e.g. lines like here uses model.eval() which disables dropout. However, I did not find dropout being disabled when comparing to the baselines.
Is it correct, that in the experiments, you were comparing FT with dropout against MeZO without dropout? Specifically to the results shown in Table 1, Table 3, and Table 16.
Would it make sense there to provide a baseline dropout disabled too?
Also, I wonder if you could use MeZO while having dropout activated. I suppose this would be possible under the condition that both forward passes use the same dropout masks. This condition could be fulfilled by taking the seed s not only for generating z but also for the forward pass after.
What do you think of this idea?
Hi,
Your understanding is correct. Dropout is deactivated for MeZO but activated for fine-tuning. For fine-tuning, dropout usually leads to better performance. For MeZO, however, it is theoretically motivated to not use dropout and we did not empirically explore the dropout version of MeZO.