cudf Experimental support for configurable prefetching

Description

This PR adds experimental support for prefetching managed memory at a select few points in libcudf. A new configuration object is introduced for handling whether prefetching is enabled or disabled, and whether to print debug information about pointers being prefetched. Prefetching control is managed on a per API basis to enable profiling of the effects of prefetching different classes of data in different contexts. Prefetching in this PR always occurs on the default stream, so it will trigger synchronization with any blocking streams that the user has created. Turning on prefetching and then passing non-blocking to any libcudf APIs will trigger undefined behavior.

Checklist

[x] I am familiar with the Contributing Guidelines.
[x] New or existing tests cover these changes.
[x] The documentation is up to date with these changes.

Jun 13 '24 18:06 vyasr

Reviewing this PR will be a bit unorthodox because of the experimental nature of this work, so we need to consider the desired outcomes on a few different timescales, the shortest of which is the merge of this PR. With that in mind, I'd consider the following sets of requirements.

To merge this PR:

The prefetching config approach needs to be acceptable to reviewers, but does not need to be made any more general. Right now it is highly targeted at allowing granular control over prefetching, which is primarily for testing and profiling purposes. We will have to rework it eventually no matter what the final state of prefetching ends up being.
The approach to prefetching in the cuco allocator shouldn't be an issue. There may be a slightly better option in the way the allocator is structured (as mentioned in the code) but we can come back to that if we deem it worthwhile.
The lack of stream-ordering is also acceptable. We only anticipate using this in cuDF Python, which at the moment has no support for stream-ordered execution anyway. Spark will not be using this in the immediate term.
We don't want to use rmm's functionality yet because the current approach allows the deeper configuration and debugging that we need at the initial stages.

Before 24.08 (If we move forward with enabling prefetching and managed memory)

We will want to remove the very granular controls. Prefetching should become an all or nothing option.
The config should probably be modified to support configuring things other than just prefetching, but support for controlling individual API prefetching should be removed.
We still shouldn't need to enable stream-ordered execution. The reason that I am avoiding that is because it would require far more wide-reaching changes than the current approach since it would mean changing the call sites of every data accessor to handle prefetching (either by changing the signature of methods like column_view_base::data and calling it appropriately, or by adding an additional prefetch method).
We should change to using rmm's prefetching functionality if it helps reduce code duplication.

Long term (again, if prefetching sticks):

We need a plan for enabling stream-ordered execution.
We need to decide if we're willing to accept configuration as a long-term feature of libcudf. I think we're going to have to if we don't want to accept the overhead of prefetching in contexts where no memory is managed, but that is up for debate.
We should make sure to do a more rigorous review of places where memory is not being prefetched. The debugging tooling in the current approach should help with that.

Jun 14 '24 16:06 vyasr

Would you please rebase to pickup https://github.com/rapidsai/cudf/pull/16024?

Jun 26 '24 05:06 GregoryKimball

Yup I have those changes locally, let me push.

Jun 26 '24 16:06 vyasr

/merge

Jul 19 '24 21:07 vyasr