Clarify explanation of requires_grad in PyTorch

Open FlightVin opened this issue 2 weeks ago • 5 comments

Fixes #3716

Description

It was challenging for me to initially grasp why requires_grad was done after weights initialization, but in the same line as bias.

The existing explanation ("we don't want that step included in the gradient") is technically correct but omits the practical consequence: Leaf Node status.

If requires_grad=True is set before the initialization math (the division by sqrt(n)), the weights tensor records that operation and becomes a calculated output (non-leaf node) rather than a source parameter. This makes it impossible for optimizers to update it.

This PR clarifies that we set requires_grad after the math to ensure the tensor remains a trainable Leaf Node.

Checklist

[x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER")
[x] Only one issue is addressed in this pull request
[ ] Labels from the issue that this PR is fixing are added to this pull request
[x] No unnecessary issues are included into this pull request.

P.S. Leaving the 3rd point unchecked since there are no label as of yet in the issue.

cc @svekars @sekyondaMeta @AlannaBurke @albanD @jbschlosser

Jan 04 '26 10:01 FlightVin