diffae
diffae copied to clipboard
Why use zero_module?
Thanks for your code for the project! It is a really nice work!
I am confused about why using zero_module, may lead to the zero_grad between the input and the output. It is possible to correctly train the model parameter with the expected grad?
Is it true that zero module is the cause of zero grad? I'm not sure about this.
By the way, we used zero grad module based on a previous work, but by itself, it also has a positive impact faster learning as well (as shown in the previous works).
Thank you for your feedback! Could you please provide the paper title or GitHub link of the "previous works"?