cccl Extract environment boilerplate code from within the device interfaces to a separate header

fixes #5606

Boilerplate code for extracting types information (stream, mr, tuning_t etc.) is too big and repetitive across the new device environment based interfaces we introduced. This PR extracts the code into a separate function and re-uses it in the existing environment based device APIs that we have (DeviceScan and DeviceReduce).

Some consideration about the design for the reviewers:

Each device primitive has its own quirks regarding which deterministm_t is supported. For example DeviceReduce::Reduce can support both gpu_to_gpu and run_to_run determinism, while DeviceReduce::ArgMax/Min or DeviceScan only support run_to_run at the moment. That means the determinism heuristics cannot be incorporated into the boilerplate code. Future environment-based APIs must individually evaluate each algorithm to determine and support the appropriate deterministic types.
The existing boilerplate code uses a lambda callable to pass the specific deterministic algorithm implementation by packing the arguments.

      auto reduce_callable = [&](auto tuning, void* storage, size_t& bytes, auto... args) {
        using tuning_t = decltype(tuning);
        return reduce_impl<tuning_t>(storage, bytes, args...);
      };

      // Dispatch with environment - handles all boilerplate
      return detail::dispatch_with_env(
        env, determinism_t{}, reduce_callable, d_in, d_out, num_items, reduction_op, ::cuda::std::identity{}, init);
    }

I need some feedback on whether this interface on the dispatch_with_env() looks sane.

Nov 13 '25 19:11 gonidelis

😬 CI Workflow Results

🟥 Finished in 1h 00m: Pass: 25%/81 | Total: 1d 05h | Max: 59m 55s | Hits: 75%/14741

See results here.

Nov 13 '25 20:11 github-actions[bot]

😬 CI Workflow Results

🟥 Finished in 3h 00m: Pass: 28%/81 | Total: 2d 04h | Max: 2h 59m | Hits: 81%/22346

See results here.

Nov 13 '25 23:11 github-actions[bot]