cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Replace `rmm::device_scalar` with `cudf::detail::device_scalar` due to unnecessary synchronization (Part 3 of miss-sync)

Open JigaoLuo opened this issue 6 months ago • 5 comments

Description

For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all rmm::device_scalar calls in libcudf are replaced with cudf::detail::device_scalar due to its internal host-pinned bounce buffer.

This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

JigaoLuo avatar Jun 08 '25 13:06 JigaoLuo

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot[bot] avatar Jun 08 '25 13:06 copy-pr-bot[bot]

I also updated DEVELOPER_GUIDE.md to promote the use of cudf::detail::device_scalar. We can discuss expanding this section further. I recommend replacing all instances of rmm::device_scalar with cudf::detail::device_scalar across libcudf as soon as feasible.

So, when is such replacement not feasible?

Only one instance of rmm::device_scalar exists outside cpp/include/cudf/detail/device_scalar.hpp is here:

https://github.com/rapidsai/cudf/blob/44ee9e5d449fdebd5afb8c32dff118460d8fd968/cpp/src/groupby/hash/compute_aggregations.cuh#L108-L126

Replacing it is not feasible as cuda::std::atomic_flag has no copy constructor (also noted in code comment), but such a copy constructor is needed by the cudf::detail::device_scalar's bounce buffer.

The error log if this replacement is attempted:
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(x)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            [ 5 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(x)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
            [ 5 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(other.exemplar)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            [ 7 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
        : exemplar(other.exemplar)
                  ^
          detected during:
            instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
            [ 7 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
      ::new (static_cast<void*>(&x)) T(exemplar);
                                      ^
          detected during:
            instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
            instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            [ 8 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu

/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
      ::new (static_cast<void*>(&x)) T(exemplar);
                                      ^
          detected during:
            instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
            instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
            instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
            instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
            [ 8 instantiation contexts not shown ]
            instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
            instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
            instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
            instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu

3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1296: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations.cu.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1311: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations_null.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1353: CMakeFiles/cudf.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2

Non-Parquet-related Changes

Changes in these following files do not impact Parquet:

  • cpp/include/cudf_test/nanoarrow_utils.hpp
  • cpp/src/copying/get_element.cu
  • cpp/src/join/sort_merge_join.cu
  • and the two test files

Replacements were still made for consistency with the DEVELOPER_GUIDE.md update. A global replacement makes more sense to me and better to fit the changed DEVELOPER_GUIDE.md. We can also discuss. Thanks

JigaoLuo avatar Jun 08 '25 13:06 JigaoLuo

/ok to test 04e5415

vuule avatar Jun 09 '25 19:06 vuule

Thank you for the PR! Some of the changes need to be reverted, the rest looks good 👍 Do you know which of these caused pageable copies in the Parquet reader?

Hi @vuule ,

In the Vanilla Parquet reader, one direct miss-sync is caused in cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh and is addressed in my commit 3fab90c467a6c6a43523e24ec76e2a46f86f1afc (as well as in Draft PR #18968.) Then no other will directly cause a miss-sync.

 

But why changing scalar.hpp

As you’re aware, I’ve replaced Thrust with extensive reuse of cudf::reduction::detail::reduce. These changes are bundled into this patch, as an in-place modification in cpp/include/cudf/reduction/detail/reduction.cuh isn’t feasible. This necessitated updates to cpp/include/cudf/scalar/scalar.hpp. If only in-place modification in reduction.cuh, the code compiles but fails at linking with undefined symbol errors: https://github.com/rapidsai/cudf/pull/18968#issuecomment-2910707766

I have an in-place modification for reproduction at this separate branch: https://github.com/JigaoLuo/cudf/tree/inplace-change-but-symbolerror-timestampscalar

The linking error looks like this:

/home/jluo/cudf-dev/cpp/build/gtests/STREAM_UNARY_TEST: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEE

I’m unsure if this is an expected error: a simple change in reduction is triggering issues with timestamp_scalar, which seems unrelated. To resolve this linking error, I modified these files together in this 7f0785a3ce2f959c91f5f2e3eab26546aec578b8 of this PR. While not ideal, I couldn’t identify a cleaner fix given the complexity. Let me know if you need further clarification!

 

The full output running unit tests (just repeating the same error): LastTest.log

JigaoLuo avatar Jun 10 '25 08:06 JigaoLuo

Addressing https://github.com/rapidsai/rmm/issues/1955 would help improve this issue with rmm directly (we shouldn't block this PR on that making that change though, it's a long-term suggestion for improvement).

vyasr avatar Jun 11 '25 17:06 vyasr

Thank you for the PR! Some of the changes need to be reverted, the rest looks good 👍 Do you know which of these caused pageable copies in the Parquet reader?

Hi @vuule ,

In the Vanilla Parquet reader, one direct miss-sync is caused in cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh and is addressed in my commit 3fab90c (as well as in Draft PR #18968.) Then no other will directly cause a miss-sync.

But why changing scalar.hpp

As you’re aware, I’ve replaced Thrust with extensive reuse of cudf::reduction::detail::reduce. These changes are bundled into this patch, as an in-place modification in cpp/include/cudf/reduction/detail/reduction.cuh isn’t feasible. This necessitated updates to cpp/include/cudf/scalar/scalar.hpp. If only in-place modification in reduction.cuh, the code compiles but fails at linking with undefined symbol errors: #18968 (comment)

I have an in-place modification for reproduction at this separate branch: https://github.com/JigaoLuo/cudf/tree/inplace-change-but-symbolerror-timestampscalar

The linking error looks like this:

/home/jluo/cudf-dev/cpp/build/gtests/STREAM_UNARY_TEST: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEE

I’m unsure if this is an expected error: a simple change in reduction is triggering issues with timestamp_scalar, which seems unrelated. To resolve this linking error, I modified these files together in this 7f0785a of this PR. While not ideal, I couldn’t identify a cleaner fix given the complexity. Let me know if you need further clarification!

The full output running unit tests (just repeating the same error): LastTest.log

This is addressed in my recent commits. The fix is: add overloads for cudf::detail::device_scalar instead of replacing in the header

JigaoLuo avatar Jun 17 '25 13:06 JigaoLuo

We’ve decided to continue the work there and have reopened the PR as a draft.

JigaoLuo avatar Aug 01 '25 09:08 JigaoLuo

@vuule Thanks for reverting and triggering the tests. The same undefined symbol error from chrono_scalar stays, both in CI and on my local setup.

( I just cached the failing CI test here: https://github.com/rapidsai/cudf/actions/runs/16688029938/job/47242262880?pr=19119

JigaoLuo avatar Aug 02 '25 14:08 JigaoLuo

/ok to test 9e7c8db

vuule avatar Aug 07 '25 02:08 vuule

@vuule @davidwendt Thanks for helping! I see the only remaining loose end is with rmm::device_scalar, but I also noticed there's a draft PR #19608 addressing that. I’ll just leave a note in the code review for logging purposes—you can feel free to ignore it.

JigaoLuo avatar Aug 07 '25 08:08 JigaoLuo

Thanks, I’ve read through it again and I think it’s perfect for merging from my perspective.

One thing I had hoped to do about three months ago was to add a small piece of documentation in the markdown: https://github.com/rapidsai/cudf/pull/19119/commits/06ae1d2cdcede2ab6332da6b22bef88648301ebd#diff-6cc9f4135444fd2ed9062e7168a007f382c2c855fccc18e859834a5d1a6de278R655

  • I'm not sure how feasible it is to include that now—should we try to add it, or just leave it as is?

JigaoLuo avatar Aug 07 '25 08:08 JigaoLuo

@JigaoLuo Can you test/profile this change with your code? I think the mis-sync is likely still here.

I believe the scalar classes would need to change their internal _data type to cudf::detail::device_scalar as well. https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L202 https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L415

Perhaps the mis-sync is only with the validity value then? https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L101

davidwendt avatar Aug 07 '25 13:08 davidwendt

One thing I had hoped to do about three months ago was to add a small piece of documentation in the markdown: 06ae1d2#diff-6cc9f4135444fd2ed9062e7168a007f382c2c855fccc18e859834a5d1a6de278R655

  • I'm not sure how feasible it is to include that now—should we try to add it, or just leave it as is?

Feel free to include this, I don't see why not.

vuule avatar Aug 07 '25 18:08 vuule

@JigaoLuo Can you test/profile this change with your code? I think the mis-sync is likely still here.

I believe the scalar classes would need to change their internal _data type to cudf::detail::device_scalar as well.

https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L202

https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L415

Perhaps the mis-sync is only with the validity value then?

https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L101

I can’t fully recall (sorry) why I missed the other rmm::device_scalar cases. It’s possible I simply overlooked them.

I believe the scalar classes would need to change their internal _data type to cudf::detail::device_scalar as well.

  • I completely agree with your point; I may have just forgotten to fix those.
  • I'm considering replacing them with cudf's device_scalar, which aligns with the intention of this PR. Do we need consensus on this before proceeding?

Regarding the test/profiling, I’m not entirely sure what’s being proposed or how I should proceed with testing and profiling.

  • In the original patch, my goal was to identify all instances of pageable memory within the Parquet reading path.
  • The remaining loose ends in rmm::device_scalar might be used in other non-Parquet functions that still involve pageable memcpy. That’s why I’m uncertain about how to design meaningful tests or profiling for those cases.

JigaoLuo avatar Aug 07 '25 21:08 JigaoLuo

  • I'm considering replacing them with cudf's device_scalar, which aligns with the intention of this PR. Do we need consensus on this before proceeding?

I don't think that is necessary for this PR to proceed.

Regarding the test/profiling, I’m not entirely sure what’s being proposed or how I should proceed with testing and profiling.

  • In the original patch, my goal was to identify all instances of pageable memory within the Parquet reading path.

I had assumed you had found the mis-sync areas in the parquet reading path and verified this change fixed them. I was just suggesting re-doing that analysis based on the now current changes. If that does not make sense then I apologize and you can ignore this.

davidwendt avatar Aug 07 '25 22:08 davidwendt

If that does not make sense then I apologize and you can ignore this.

No worries at all!

Let me briefly summarize what I’ve done: I identified the mis-synchronization areas in the Parquet reader and addressed them all. These fixes are all collected in the draft PR #18968 to demonstrate the overall improvement. The current PR is just a small subset of that draft and it only resolves part of the mis-sync issues.

Regarding profiling: I can run the same command I used earlier in https://github.com/rapidsai/cudf/pull/18968#issuecomment-2919937779 to show a simple count of mis-sync occurrences before and after this patch. It’s a quick task, and I think it’s worth doing.

This is just an example to count cuda_memcpy_async with pageable memory:

$  nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861

JigaoLuo avatar Aug 08 '25 08:08 JigaoLuo

@vuule Would you please describe the next steps before we are ready to merge?

GregoryKimball avatar Aug 08 '25 21:08 GregoryKimball

@vuule Would you please describe the next steps before we are ready to merge?

I think the code changes can be merged as they are now, I'm just waiting for @JigaoLuo to add the change described here to proceed with the (final) reviews.

vuule avatar Aug 08 '25 21:08 vuule

I think the code changes can be merged as they are now

I think this should better wait until https://github.com/rapidsai/cudf/pull/19608 merges in, then rebase before merging.

ttnghia avatar Aug 08 '25 21:08 ttnghia

I’ve reverted the previous changes in DEVELOPER_GUIDE.md.

Additionally, I ran a similar profiling to https://github.com/rapidsai/cudf/pull/18968#issuecomment-2919937779 and observed a slight drop in the number of mis-sync:

The command:

for t in 1 2 4 8 16 32 64 128; 
do 
  ./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done

# then with nsys profile

nsys export -t sqlite report{thread-case}.nsys-rep

nsys analyze -r cuda_memcpy_async:rows=-1 report{thread-case}.sqlite| wc -l

The current branch 25.10: Before this PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
63181

With this PR

$ nsys analyze -r cuda_memcpy_async:rows=-1 report_PR_8threads.sqlite | wc -l
63021

JigaoLuo avatar Aug 09 '25 09:08 JigaoLuo

The target number we're aiming for is the one mentioned in the draft PR comment : ~301, based on the same binary benchmark run.

I don’t recall the exact impact of each individual change on the reduction of mis-syncs, but from what I observed, the thrust reduction seemed to trigger the most mis-synchronizations. There are also a lot of calls of thrust reduction.

Apologies for the delay—I wasn’t able to find time over the past two days. I also have a major paper deadline coming up, so I want to apologize in advance for any slow responses.

JigaoLuo avatar Aug 09 '25 09:08 JigaoLuo

Now I turn this PR Draft into a PR. But let's wait for the #19608

JigaoLuo avatar Aug 09 '25 09:08 JigaoLuo

Ok, #19608 is merge. I think this will be good to go once the merge conflict is resolved here.

davidwendt avatar Aug 11 '25 18:08 davidwendt

Thanks for the notification. I’ve resolved the conflict. Before merging, I’d also like to request a code review for the changes made in DEVELOPER_GUIDE.md.

JigaoLuo avatar Aug 11 '25 19:08 JigaoLuo

/ok to test 6708c96

vuule avatar Aug 12 '25 17:08 vuule

Thanks again for your help—really appreciate it!

Also, big thanks to everyone here, including RMM forks.

JigaoLuo avatar Aug 12 '25 18:08 JigaoLuo

/ok to test 5fc341155d60628456b5d6fc5507276bb9a3412a

davidwendt avatar Aug 12 '25 19:08 davidwendt

/merge

vuule avatar Aug 12 '25 22:08 vuule