vision CV-CUDA Kernels Inherit torch.cuda.current

Summary

CV-CUDA has its own default stream which it will use to execute its kernels. This behavior is fine when using the Compose transform API with explicit conversion at the start and end or when using F.cvcuda_to_tensor and F.tensor_to_cvcuda. This is because CV-CUDA will synchronize its own stream when sharing with external memory. However, there are a few edge cases which I believe give us motivation to have CV-CUDA share the PyTorch current CUDA stream.

When using torch.cuda.synchronize() with cvcuda.Tensor in the functional API, there is no work to synchronize on for PyTorch, since the work has been queued on a different stream.
If a user specifies a specific CUDA stream with with torch.cuda.current_stream() or similar call, the work will get scheduled in a separate stream. In certain scenarios this could result in degraded performance via context switching and in general is non-intuitive behavior.
Should a user want to synchronize while using the functional API and CV-CUDA backend, they would have to use the cvcuda.Stream.current.sync() call which introduces unneeded complexity/library-mixing in user code.

I propose we implement a decorator/wrapper function which will handle assignment of the current CV-CUDA stream based on the current torch.cuda stream. This allows the behavior of CV-CUDA kernels in TorchVision to function much closer variants with PyTorch tensors.

Implementation

def _cvcuda_shared_stream(fn: Callable[P, R]) -> Callable[P, R]:
    # import cvcuda once during function wrapping time
    cvcuda = _import_cvcuda()

    @functools.wraps(fn)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
        # get the current torch cuda stream during function call time
        stream = torch.cuda.current_stream()

        # cvcuda.Stream supports context managers to assign the threadlocal current stream
        with cvcuda.as_stream(stream):
            # will call the cvcuda operator, which will use the current stream by default
            # since this is wrapped with a cvcuda.Stream context manager, it will use that stream
            result = fn(*args, **kwargs)

        return result

    return wrapper

Example of wrapping the existing vertical_flip kernel for CV-CUDA:

def _vertical_flip_image_cvcuda(image: "cvcuda.Tensor") -> "cvcuda.Tensor":
    return _import_cvcuda().flip(image, flipCode=0)


if CVCUDA_AVAILABLE:
    _register_kernel_internal(vertical_flip, _import_cvcuda().Tensor)(
        _cvcuda_shared_stream(_vertical_flip_image_cvcuda)
    )

Testing

As of right now, there is no testing strategy for this change in place. The naive approach would be to assert that the CV-CUDA kernels do not block without this behavior, and blocks with this behavior (via the higher-level functional version) with torch.cuda.synchronize(). An alternative could potentially use torch.cuda.Event

Feedback

I would love to get feedback on whether this change should be pursued and the testing strategy if this is behavior the team wants in TorchVision.

Dec 09 '25 20:12 justincdavis

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9308

:page_facing_up: Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Dec 09 '25 20:12 pytorch-bot[bot]

Thanks for the PR @justincdavis and for bringing this up. I'll have to think more, but this seems reasonable from a quick look.

Re testing, does CVCUDA expose an API to get the current stream it's working on, something like https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html ? If it does, maybe a small test like this one would be enough

new_stream = torch.cuda.stream(None)

def assert_cvcuda_is_using_torch_stream():
    assert cvcuda.current_stream() == new_stream

with torch.cuda.stream(new_stream):
    _cvcuda_shared_stream(assert_cvcuda_is_using_torch_stream)()

Dec 10 '25 13:12 NicolasHug

Hi @NicolasHug CVCUDA does expose this behavior! I added a simple positive/negative test to check the handles of the two streams.

Dec 10 '25 16:12 justincdavis

CV-CUDA Kernels Inherit torch.cuda.current_stream()

Summary

Implementation

Testing

Feedback

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9308