Fix double-tracing in SpecPropPass
Our current SpecPropPass doesn't properly capture the effect of guards in the shape environment due to double-tracing certain ops. The problem looks like this:
- Every time we trace through the graph, we generate new symints.
- That's fine, since shape_env will pick up guards during the retrace.
- Problem is that SpecPropPass does this twice. Once to generate the spec and then once by calling super().call_operator(...) (https://github.com/.../exir/passes/spec_prop_pass.py...).
- The tensor spec gets the symint from the first. But the graph and guards use the second.
- Hence the tensor spec doesn't pick up on guards.
To resolve this, I've updated the SpecPropPass to re-trace the graph and then generate specs based on the meta values, not the traced ProxyValues (thanks @angelayi for the suggestion). This resolves the issue.
I originally saw this issue with the NMS torchvision op, but to avoid adding a new dep to the core EXIR tests, I've written a test with a custom op that uses an unbacked symint in the meta kernel output shape to replicate the bug in the same way.
Differential Revision: D85913581
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/15485
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: You can merge normally! (1 Unrelated Failure)
As of commit 7da6e25b874d99326f95f536af823000d2bf1c39 with merge base 18c1c5b515c49ae00c6946b97cde5daff43dcf1a ():
BROKEN TRUNK - The following job failed but were present on the merge base:
👉 Rebase onto the `viable/strict` branch to avoid these failures
- pull / android / run-emulator (gh) (trunk failure)
Timeout waiting for emulator to boot.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
@GregoryComer has exported this pull request. If you are a Meta employee, you can view the originating Diff in D85913581.
One additional note is that aliasing analysis feels pretty fragile as is. I fixed several subtle issues (luckily caught by CI) where my changes were accidentally planning two seperate tensors when they should alias / share one TensorSpec.
I'm wondering if we should re-write this pass again to either rely on ProxyValue reference equality or otherwise introduce some proper aliasing analysis. This is as opposed to hard coding that getitem and output, for example, always alias their argument.
This seems like it could get messy with non-functional custom ops or defunctionalization, in general. @JacobSzwejbka @angelayi what are your thoughts on this?
@GregoryComer has imported this pull request. If you are a Meta employee, you can view this in D85913581.
Note that the moshi and zephyr size test failures are pre-existing.