cuda-python icon indicating copy to clipboard operation
cuda-python copied to clipboard

cuda.core.system: Better checks for when we expect APIs to be unsupported

Open mdboom opened this issue 1 month ago • 7 comments

Prior to this PR, unit tests in cuda.core.system would just always skip the test if an API returned an NotSupportedError. This meant that a test could be skipped even when the documentation suggests that it should in fact work. This PR makes it so such a skip would only be acceptable if the API is documented as not being supported for a given architecture.

There is some nuance, though. Some of these APIs are still failing as not supported even when the documentation says they should be. This may be because some other aspect of the device (not just the architecture) makes it unsupported. Unfortunately, the NVML headers don't offer any clues there, but at least now we know where all of these cases are (because they are noted with unsupported_before(device, None), and we can follow up by filing documentation bugs with NVML and refining the criteria by which tests are skipped even further.

This also removes the behavior of some tests where it would collect different skip reasons for each device and report all of them. This added a lot of complexity for little benefit -- it is probably very rare that a system would have two devices of different architectures, especially on systems that we would use for testing -- so I just took all of that out to simplify things.

mdboom avatar Jan 16 '26 18:01 mdboom

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jan 16 '26 18:01 copy-pr-bot[bot]

/ok to test

mdboom avatar Jan 16 '26 18:01 mdboom

Doc Preview CI :---: |

:rocket: View preview at
https://nvidia.github.io/cuda-python/pr-preview/pr-1510/
|
https://nvidia.github.io/cuda-python/pr-preview/pr-1510/cuda-core/
|
https://nvidia.github.io/cuda-python/pr-preview/pr-1510/cuda-bindings/
|
https://nvidia.github.io/cuda-python/pr-preview/pr-1510/cuda-pathfinder/

|

Preview will be ready when the GitHub Pages deployment is complete.

github-actions[bot] avatar Jan 16 '26 18:01 github-actions[bot]

/ok to test

mdboom avatar Jan 16 '26 19:01 mdboom

/ok to test

mdboom avatar Jan 16 '26 19:01 mdboom

/ok to test

mdboom avatar Jan 16 '26 19:01 mdboom

/ok to test

mdboom avatar Jan 16 '26 20:01 mdboom

/ok to test

mdboom avatar Jan 20 '26 13:01 mdboom

/ok to test

mdboom avatar Jan 20 '26 14:01 mdboom

/ok to test

mdboom avatar Jan 20 '26 15:01 mdboom

Just logging a concern: While this sounds reasonable today, it could be very surprising in the future, because a skip for one device may mask issues with another, and may silently reduce test coverage in a way that's easy to miss. Skips generally look harmless, while these skips are maybe not. Ideally the potentially failure-masking skips should come with a warning; in this case, only if there are more devices left to loop over, which isn't so easy to get right I guess.

Thinking about this more, I think we can write a test for this case, so at least it might offer a clue to our future selves. I'll add it here.

mdboom avatar Jan 22 '26 12:01 mdboom

/ok to test

mdboom avatar Jan 22 '26 12:01 mdboom

/ok to test

mdboom avatar Jan 22 '26 13:01 mdboom

Doc Preview CI :---: Preview removed because the pull request was closed or merged.

github-actions[bot] avatar Jan 22 '26 15:01 github-actions[bot]