pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

[MPS] Fix overflow in cumsum when dtype is bool

Open hvaara opened this issue 1 year ago • 10 comments

cumsum and cumprod was (is?) buggy for MPS: https://github.com/pytorch/pytorch/blob/c8d2a55273757c90989fde7c6f05e957aba9a238/aten/src/ATen/native/mps/operations/UnaryOps.mm#L435-L436

A workaround casts the input to int32 prior to performing the op to prevent overflow for certain numeric types.

It turns out this issue also affects boolean types:

import torch
print(torch.ones(128, dtype=torch.bool, device="mps").cumsum(0)[-1])
# tensor(-128, device='mps:0')

In this PR I'm adding logic to also cast bool dtypes to int32 prior to cumsum and cumprod, although output is guaranteed not to overflow for the latter with bools. I'm also adding a test to prevent regressions.

Fixes #96614

/cc @kulinseth @malfet

hvaara avatar May 01 '24 18:05 hvaara

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/125318

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit efe7f5381b4a54e0aa0fc20cd66e1674706423a5 with merge base 8046de351285c4eae765e1b16da0e26e500a6d58 (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar May 01 '24 18:05 pytorch-bot[bot]

CLA Signed

The committers listed above are authorized under a signed CLA.

  • :white_check_mark: login: hvaara / name: Roy Hvaara (229b9e93e795c5a2b76d45259a0f8cdfbeefe16c, efe7f5381b4a54e0aa0fc20cd66e1674706423a5)

@hvaara , which MacOS you observed this issue ?

kulinseth avatar May 01 '24 19:05 kulinseth

@kulinseth macOS 14.4.1.

hvaara avatar May 01 '24 19:05 hvaara

Looks like the ciflow/mps label wasn't added automatically. I'm not sure why this didn't happen as the source paths modified matches the autolabel regex:

https://github.com/pytorch/test-infra/blob/45e6a91c86e29409ec74a2835b43d08709a8d0d2/torchci/lib/bot/autoLabelBot.ts#L141

I'll try to add it manually, but I might not be allowed since I'm not part of the pytorch org.

@pytorchbot label "ciflow/mps"

hvaara avatar May 02 '24 13:05 hvaara

Looks like the ciflow/mps label wasn't added automatically.

It wasn't added automatically as you have not contributed in the past. Alas, it's a necessary security measure to prevent resource misuse, but once workflows were approved by a member of the org anyone should be to add labels.

malfet avatar May 02 '24 13:05 malfet

It wasn't added automatically as you have not contributed in the past.

Got it, that makes sense. Thanks for clarifying! 😄

hvaara avatar May 02 '24 14:05 hvaara

Looks like TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float32 now has switched to Unexpected success (ref MACOS_BEFORE_13_3_XFAILLIST_GRAD). Which could actually be good news. https://github.com/pytorch/pytorch/issues/106112 seems related. I was surprised by this and will investigate.

https://github.com/pytorch/pytorch/blob/8046de351285c4eae765e1b16da0e26e500a6d58/test/test_mps.py#L160-L163

contains a couple more dtypes that might be affected. It would be interesting to get the ci to test these dtypes as well, as I don't have access to macOS < 13.3. I remember reading somewhere that you can tell the ci to continue at failure instead of early stopping. I don't remember where I read it or how to do it.

hvaara avatar May 02 '24 14:05 hvaara

Google didn't help, but grepping the source code did 🤣

@pytorchbot label "keep-going"

hvaara avatar May 02 '24 14:05 hvaara

https://github.com/pytorch/pytorch/blob/8046de351285c4eae765e1b16da0e26e500a6d58/scripts/compile_tests/update_failures.py#L28-L31

I'll force-push to get the tests running again.

hvaara avatar May 02 '24 14:05 hvaara

From the ci logs it seems

  • TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float32
  • TestConsistencyCPU.test_output_grad_match_cumprod_cpu_float16
  • TestConsistencyCPU.test_output_grad_match_masked_cumprod_cpu_float16

are all failing with Unexpected success. So it seems it's also related to https://github.com/pytorch/pytorch/issues/109166.

I'll remove the entries from MACOS_BEFORE_13_3_XFAILLIST_GRAD and rerun the unit tests. If the tests pass I guess this PR also fixes the issues seen in #106112 and #109166?

hvaara avatar May 02 '24 18:05 hvaara

Alright, looks like all the tests passed. I'll add #106112 and #109166 to the fixes keyword in the first comment. Then this should be good to go, so I'll attempt to ship it.

hvaara avatar May 02 '24 19:05 hvaara

@pytorchbot merge

hvaara avatar May 02 '24 19:05 hvaara

Pull workflow has not been scheduled for the PR yet. It could be because author doesn't have permissions to run those or skip-checks keywords were added to PR/commits, aborting merge. Please get/give approval for the workflows and/or remove skip ci decorators before next merge attempt. If you think this is a mistake, please contact PyTorch Dev Infra.

pytorch-bot[bot] avatar May 02 '24 19:05 pytorch-bot[bot]

@kulinseth @malfet Can you please help me with next steps? I don't know what I'm missing/doing wrong to get this merged.

hvaara avatar May 02 '24 19:05 hvaara

@pytorchbot merge -f "Lint and MPS tests are green"

malfet avatar May 03 '24 01:05 malfet

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging Check the merge workflow status here

pytorchmergebot avatar May 03 '24 01:05 pytorchmergebot

Thanks for the reviews and help! 😄

hvaara avatar May 03 '24 01:05 hvaara