Use nosync execution policy everywhere
Description
This is a follow-up to #11577 and #12086. We discussed the exec_policy_nosync and we would like to experiment with enabling it everywhere in libcudf.
Since last time we investigated this, we have refined a lot of the library to be more stream-friendly. We also have weekly compute-sanitizer runs now, to help us identify any issues. We would like to see if this changes provides any performance improvements or reduces synchronization in the multi-thread, multi-stream workflow engines that use libcudf.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
AFAIK compute-sanitizer won't help us find the issues with host object going out of scope because we removed a sync point. I'm not sure we should replace all without examining each instance for this kind of issue.
It may help to investigate these errors with https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html#stream-ordered-race-detection: compute-sanitizer memcheck --track-stream-ordered-races all
It may help to investigate these errors with https://docs.nvidia.com/compute-sanitizer/ComputeSanitizer/index.html#stream-ordered-race-detection:
compute-sanitizer memcheck --track-stream-ordered-races all
Thought about this a bit more. The scenario I was concerned with is when we use a std::vector and copy it to the device (cudaMemcpyAsync) at the end of its scope. In theory we might perform the copy after the vector is already deallocated. However, pageable H2D copies are synchronous, so this is not an issue in practice, unless I'm missing something.
Still, IMO we might remove some needed sync points if we don't review this PR carefully.
I'm putting this on pause until after I get #20800 and some other projects finished up first.