xla icon indicating copy to clipboard operation
xla copied to clipboard

Lack of 2.9 wheel causing torchprime test error

Open pgmoka opened this issue 5 months ago • 2 comments

Currently we have not yet built the 2.9 wheel which is cause torchprime tests to fail (See https://github.com/pytorch/xla/actions/runs/16177668026/job/45667831241 for an example).

I believe https://github.com/pytorch/xla/pull/9461 should trigger the build necessary to resolve the issue.

pgmoka avatar Jul 09 '25 21:07 pgmoka

I have drilled down further here. The torchprime test errors seem to be coming from trying to build a version of pytorch that does not exist torch-2.9.0-cp312-cp312-linux_x86_64.whl from its dockerfile https://github.com/pytorch/xla/blob/master/infra/ansible/ptxla_docker_for_torchprime.Dockerfile.

From what I can gather, this is happening due to an issue with its dockerfile trying to re-utilize the wheels built for the test from the initial build_and_test.yml call. Some edge case interaction is causing it to try to call the stable version of torch for the torchprime tests. This stands in contrast to https://github.com/pytorch/xla/blob/93a5e5833ea68507ba9fc63976ac3f775132b155/infra/ansible/Dockerfile#L4 which builds the wheels from bottom up, and do not seem to run into the same problem.

I believe pursuing a similar approach as https://github.com/pytorch/xla/blob/93a5e5833ea68507ba9fc63976ac3f775132b155/infra/ansible/Dockerfile#L4 to build everything from scratch should resolve the issue, but hit performance. Given I am being pulled to different issues, It might be a reasonable tradeoff to get tests rolling again.

The rest of the "build and test" tests do not seem to do that. I will take a look at those other tests to see if there are more clues there.

pgmoka avatar Jul 15 '25 17:07 pgmoka

https://github.com/pytorch/xla/issues/9466#issuecomment-3074640270 was going down the wrong rabbit hole. The issue actually seems to be tied to updating to PyTorch 3.12. Latest commit to on https://github.com/pytorch/xla/pull/9481 (commit link) tested that theory, and we were able to move forwad with the build. Currently blocked by having to update the launcher for torchprime (https://github.com/AI-Hypercomputer/torchprime/pull/342)

pgmoka avatar Jul 16 '25 16:07 pgmoka