[MPS] Fix overflow in cumsum when dtype is bool
cumsum and cumprod was (is?) buggy for MPS: https://github.com/pytorch/pytorch/blob/c8d2a55273757c90989fde7c6f05e957aba9a238/aten/src/ATen/native/mps/operations/UnaryOps.mm#L435-L436
A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.
It turns out this issue also affects boolean types:
import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')
In this PR I'm adding logic to also cast bool dtypes to int32 prior to cumsum and cumprod, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.
Fixes #96614
/cc @kulinseth @malfet
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125318
- :page_facing_up: Preview Python docs built from this PR
- :page_facing_up: Preview C++ docs built from this PR
- :question: Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit efe7f5381b4a54e0aa0fc20cd66e1674706423a5 with merge base 8046de351285c4eae765e1b16da0e26e500a6d58 ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
The committers listed above are authorized under a signed CLA.
- :white_check_mark: login: hvaara / name: Roy Hvaara (229b9e93e795c5a2b76d45259a0f8cdfbeefe16c, efe7f5381b4a54e0aa0fc20cd66e1674706423a5)
@hvaara , which MacOS you observed this issue ?
@kulinseth macOS 14.4.1.
Looks like the ciflow/mps label wasn't added automatically. I'm not sure why this didn't happen as the source paths modified matches the autolabel regex:
https://github.com/pytorch/test-infra/blob/45e6a91c86e29409ec74a2835b43d08709a8d0d2/torchci/lib/bot/autoLabelBot.ts#L141
I'll try to add it manually, but I might not be allowed since I'm not part of the pytorch org.
@pytorchbot label "ciflow/mps"
Looks like the
ciflow/mpslabel wasn't added automatically.
It wasn't added automatically as you have not contributed in the past. Alas, it's a necessary security measure to prevent resource misuse, but once workflows were approved by a member of the org anyone should be to add labels.
It wasn't added automatically as you have not contributed in the past.
Got it, that makes sense. Thanks for clarifying! 😄
Looks like TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float32 now has switched to Unexpected success (ref MACOS_BEFORE_13_3_XFAILLIST_GRAD). Which could actually be good news. https://github.com/pytorch/pytorch/issues/106112 seems related. I was surprised by this and will investigate.
https://github.com/pytorch/pytorch/blob/8046de351285c4eae765e1b16da0e26e500a6d58/test/test_mps.py#L160-L163
contains a couple more dtypes that might be affected. It would be interesting to get the ci to test these dtypes as well, as I don't have access to macOS < 13.3. I remember reading somewhere that you can tell the ci to continue at failure instead of early stopping. I don't remember where I read it or how to do it.
Google didn't help, but grepping the source code did 🤣
@pytorchbot label "keep-going"
https://github.com/pytorch/pytorch/blob/8046de351285c4eae765e1b16da0e26e500a6d58/scripts/compile_tests/update_failures.py#L28-L31
I'll force-push to get the tests running again.
From the ci logs it seems
-
TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float32 -
TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float16 -
TestConsistencyCPU.test_output_grad_match_masked_cumprod_cpu_float16
are all failing with Unexpected success. So it seems it's also related to https://github.com/pytorch/pytorch/issues/109166.
I'll remove the entries from MACOS_BEFORE_13_3_XFAILLIST_GRAD and rerun the unit tests. If the tests pass I guess this PR also fixes the issues seen in #106112 and #109166?
Alright, looks like all the tests passed. I'll add #106112 and #109166 to the fixes keyword in the first comment. Then this should be good to go, so I'll attempt to ship it.
@pytorchbot merge
Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.
@kulinseth @malfet Can you please help me with next steps? I don't know what I'm missing/doing wrong to get this merged.
@pytorchbot merge -f "Lint and MPS tests are green"
Merge started
Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.
Learn more about merging in the wiki.
Questions? Feedback? Please reach out to the PyTorch DevX TeamAdvanced Debugging
Check the merge workflow status
here
Thanks for the reviews and help! 😄