`requires_grad` attribute for cache key
🐛 Bug
The key making logic for Thunder's cache relies on requires_grad attribute of TensorProxy but it shouldn't:
https://github.com/Lightning-AI/lightning-thunder/blob/a8c063676cbecad7c196dcfe4fbeacf85775c9c5/thunder/common.py#L394-L400
Thunder does not propagate requires_grad information throughout the trace and therefore this results in many unnecessary cache misses.
cc @mruberry @lantiga @IvanYashchuk
But how does the lack of propagation actually create indeterminism here?
What do you mean by indeterminism?
I found this problem in a very specific way but i think it's a good start to get the idea why I wrote the issue.
I was debugging an executor and more specifically the creation of the joint forward and backward trace two torch.nn.functional.linear calls one after the other. I've noticed that even tho the symbols were the same, the inputs also the same, and cacheing was enabled, the executor received two calls for the grad transformation of the symbol.
The reason for this was a cache miss here: https://github.com/Lightning-AI/lightning-thunder/blob/2d18b7a07fb5e83d138707b7bf97bd2f82da4f2c/thunder/core/vjp_utils.py#L58-L61
In my opinion that should not happen because we do not track the requires_grad of a tensor throughout the trace lifetime.
Another idea that I've been thinking could be to relegate the key creation to the executor, for example the StatefulExecutor would benefit from a custom key creation(say when doing checkpointing, the proxies have the same name and that should signal that the bound symbol is the same for the backward as the one encountered in the forward)
So I think we'd be dropping the caching you link after #1956 , hopefully?