CV-CUDA Kernels Inherit torch.cuda.current_stream()
Summary
CV-CUDA has its own default stream which it will use to execute its kernels. This behavior is fine when using the Compose transform API with explicit conversion at the start and end or when using F.cvcuda_to_tensor and F.tensor_to_cvcuda. This is because CV-CUDA will synchronize its own stream when sharing with external memory. However, there are a few edge cases which I believe give us motivation to have CV-CUDA share the PyTorch current CUDA stream.
- When using
torch.cuda.synchronize()withcvcuda.Tensorin the functional API, there is no work to synchronize on for PyTorch, since the work has been queued on a different stream. - If a user specifies a specific CUDA stream with
with torch.cuda.current_stream()or similar call, the work will get scheduled in a separate stream. In certain scenarios this could result in degraded performance via context switching and in general is non-intuitive behavior. - Should a user want to synchronize while using the functional API and CV-CUDA backend, they would have to use the
cvcuda.Stream.current.sync()call which introduces unneeded complexity/library-mixing in user code.
I propose we implement a decorator/wrapper function which will handle assignment of the current CV-CUDA stream based on the current torch.cuda stream. This allows the behavior of CV-CUDA kernels in TorchVision to function much closer variants with PyTorch tensors.
Implementation
def _cvcuda_shared_stream(fn: Callable[P, R]) -> Callable[P, R]:
# import cvcuda once during function wrapping time
cvcuda = _import_cvcuda()
@functools.wraps(fn)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> R:
# get the current torch cuda stream during function call time
stream = torch.cuda.current_stream()
# cvcuda.Stream supports context managers to assign the threadlocal current stream
with cvcuda.as_stream(stream):
# will call the cvcuda operator, which will use the current stream by default
# since this is wrapped with a cvcuda.Stream context manager, it will use that stream
result = fn(*args, **kwargs)
return result
return wrapper
Example of wrapping the existing vertical_flip kernel for CV-CUDA:
def _vertical_flip_image_cvcuda(image: "cvcuda.Tensor") -> "cvcuda.Tensor":
return _import_cvcuda().flip(image, flipCode=0)
if CVCUDA_AVAILABLE:
_register_kernel_internal(vertical_flip, _import_cvcuda().Tensor)(
_cvcuda_shared_stream(_vertical_flip_image_cvcuda)
)
Testing
As of right now, there is no testing strategy for this change in place. The naive approach would be to assert that the CV-CUDA kernels do not block without this behavior, and blocks with this behavior (via the higher-level functional version) with torch.cuda.synchronize(). An alternative could potentially use torch.cuda.Event
Feedback
I would love to get feedback on whether this change should be pursued and the testing strategy if this is behavior the team wants in TorchVision.
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9308
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Thanks for the PR @justincdavis and for bringing this up. I'll have to think more, but this seems reasonable from a quick look.
Re testing, does CVCUDA expose an API to get the current stream it's working on, something like https://docs.pytorch.org/docs/stable/generated/torch.cuda.current_stream.html ? If it does, maybe a small test like this one would be enough
new_stream = torch.cuda.stream(None)
def assert_cvcuda_is_using_torch_stream():
assert cvcuda.current_stream() == new_stream
with torch.cuda.stream(new_stream):
_cvcuda_shared_stream(assert_cvcuda_is_using_torch_stream)()
Hi @NicolasHug CVCUDA does expose this behavior! I added a simple positive/negative test to check the handles of the two streams.