Is requires_grad mandatory?
Does every tensor used in TE need to have requires_grad = True ?
I needed to add a dummy tensor for compatibility purposes to get activation checkpointing to work for TE in Megatron-DeepSpeed PR here. I had to set requires_grad for TE to work, and am wondering if that is always the case.
It should not be mandatory - could you share the error you are getting or some small repro of the problem?
This is the stack trace:
Fails when this line from Megatron-DeepSpeed is changed to False https://github.com/microsoft/Megatron-DeepSpeed/blob/4822c87ee6adfa4e480614cbe3f1d8ae00bd3db7/megatron/model/transformer.py#L1754C1-L1754C107
@timmoon10 Could you take a look at that?