Replace `rmm::device_scalar` with `cudf::detail::device_scalar` due to unnecessary synchronization (Part 3 of miss-sync)
Description
For issue #18967, this PR is one part of merging the PR Draft #18968. In this PR, almost all rmm::device_scalar calls in libcudf are replaced with cudf::detail::device_scalar due to its internal host-pinned bounce buffer.
This is also a call to action to use host-pinned memory globally in libcudf, with arguments stated in #18967 and #18968.
Checklist
- [x] I am familiar with the Contributing Guidelines.
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
I also updated DEVELOPER_GUIDE.md to promote the use of cudf::detail::device_scalar. We can discuss expanding this section further. I recommend replacing all instances of rmm::device_scalar with cudf::detail::device_scalar across libcudf as soon as feasible.
So, when is such replacement not feasible?
Only one instance of rmm::device_scalar exists outside cpp/include/cudf/detail/device_scalar.hpp is here:
https://github.com/rapidsai/cudf/blob/44ee9e5d449fdebd5afb8c32dff118460d8fd968/cpp/src/groupby/hash/compute_aggregations.cuh#L108-L126
Replacing it is not feasible as cuda::std::atomic_flag has no copy constructor (also noted in code comment), but such a copy constructor is needed by the cudf::detail::device_scalar's bounce buffer.
The error log if this replacement is attempted:
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
: exemplar(x)
^
detected during:
instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
[ 5 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(306): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
: exemplar(x)
^
detected during:
instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const T &) [with T=cuda::std::__4::atomic_flag]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::uninitialized_fill_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 86 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
instantiation of "thrust::detail::disable_if<thrust::detail::allocator_traits_detail::needs_default_construct_via_allocator<Allocator, thrust::detail::pointer_element<Pointer>::type>::value, void>::type thrust::detail::allocator_traits_detail::value_initialize_range(Allocator &, Pointer, Size) [with Allocator=cudf::detail::rmm_host_allocator<cuda::std::__4::atomic_flag>, Pointer=cuda::std::__4::atomic_flag *, Size=std::size_t]" at line 94 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/allocator/value_initialize_range.inl
[ 5 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
: exemplar(other.exemplar)
^
detected during:
instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
[ 7 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(311): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
: exemplar(other.exemplar)
^
detected during:
instantiation of "thrust::detail::uninitialized_fill_functor<T>::uninitialized_fill_functor(const thrust::detail::uninitialized_fill_functor<T> &) [with T=cuda::std::__4::atomic_flag]" at line 64 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 55 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/uninitialized_fill.inl
[ 7 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
::new (static_cast<void*>(&x)) T(exemplar);
^
detected during:
instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
[ 8 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu
/home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/internal_functional.h(320): error: function "cuda::std::__4::atomic_flag::atomic_flag(const cuda::std::__4::atomic_flag &)" (declared at line 640 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/libcudacxx/../../../libcudacxx/include/cuda/std/atomic) cannot be referenced -- it is a deleted function
::new (static_cast<void*>(&x)) T(exemplar);
^
detected during:
instantiation of "void thrust::detail::uninitialized_fill_functor<T>::operator()(T &) [with T=cuda::std::__4::atomic_flag]" at line 44 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/function.h
instantiation of "Result thrust::detail::wrapped_function<Function, Result>::operator()(Ts &&...) const [with Function=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>, Result=void, Ts=<cuda::std::__4::atomic_flag &>]" at line 70 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/sequential/for_each.h
instantiation of "InputIterator thrust::system::detail::sequential::for_each_n(thrust::system::detail::sequential::execution_policy<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 66 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/detail/for_each.inl
instantiation of "InputIterator thrust::for_each_n(const thrust::detail::execution_policy_base<DerivedPolicy> &, InputIterator, Size, UnaryFunction) [with DerivedPolicy=thrust::system::cpp::detail::tag, InputIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, UnaryFunction=thrust::detail::uninitialized_fill_functor<cuda::std::__4::atomic_flag>]" at line 89 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
instantiation of "ForwardIterator thrust::system::detail::generic::detail::uninitialized_fill_n(thrust::execution_policy<DerivedPolicy> &, ForwardIterator, Size, const T &, thrust::detail::false_type) [with DerivedPolicy=thrust::system::cpp::detail::tag, ForwardIterator=cuda::std::__4::atomic_flag *, Size=std::size_t, T=cuda::std::__4::atomic_flag]" at line 115 of /home/jluo/cudf-dev/cpp/build/_deps/cccl-src/lib/cmake/thrust/../../../thrust/thrust/system/detail/generic/uninitialized_fill.inl
[ 8 instantiation contexts not shown ]
instantiation of "cudf::detail::host_vector<T>::host_vector(size_t, const cudf::detail::rmm_host_allocator<T> &) [with T=cuda::std::__4::atomic_flag]" at line 486 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector_async<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 502 of /home/jluo/cudf-dev/cpp/include/cudf/detail/utilities/vector_factories.hpp
instantiation of "cudf::detail::host_vector<T> cudf::detail::make_pinned_vector<T>(size_t, rmm::cuda_stream_view) [with T=cuda::std::__4::atomic_flag]" at line 57 of /home/jluo/cudf-dev/cpp/include/cudf/detail/device_scalar.hpp
instantiation of "cudf::detail::device_scalar<T>::device_scalar(rmm::cuda_stream_view, rmm::device_async_resource_ref) [with T=cuda::std::__4::atomic_flag]" at line 108 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cuh
instantiation of "rmm::device_uvector<cudf::size_type> cudf::groupby::detail::hash::compute_aggregations(int64_t, __nv_bool, const cudf::bitmask_type *, SetType &, cudf::host_span<const cudf::groupby::aggregation_request, 18446744073709551615UL>, cudf::detail::result_cache *, rmm::cuda_stream_view) [with SetType=cudf::groupby::detail::hash::nullable_global_set_t]" at line 21 of /home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu
3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1296: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations.cu.o] Error 1
gmake[2]: *** Waiting for unfinished jobs....
3 errors detected in the compilation of "/home/jluo/cudf-dev/cpp/src/groupby/hash/compute_aggregations_null.cu".
gmake[2]: *** [CMakeFiles/cudf.dir/build.make:1311: CMakeFiles/cudf.dir/src/groupby/hash/compute_aggregations_null.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:1353: CMakeFiles/cudf.dir/all] Error 2
gmake: *** [Makefile:166: all] Error 2
Non-Parquet-related Changes
Changes in these following files do not impact Parquet:
cpp/include/cudf_test/nanoarrow_utils.hppcpp/src/copying/get_element.cucpp/src/join/sort_merge_join.cu- and the two test files
Replacements were still made for consistency with the DEVELOPER_GUIDE.md update. A global replacement makes more sense to me and better to fit the changed DEVELOPER_GUIDE.md. We can also discuss. Thanks
/ok to test 04e5415
Thank you for the PR! Some of the changes need to be reverted, the rest looks good 👍 Do you know which of these caused pageable copies in the Parquet reader?
Hi @vuule ,
In the Vanilla Parquet reader, one direct miss-sync is caused in cpp/include/cudf/detail/sizes_to_offsets_iterator.cuh and is addressed in my commit 3fab90c467a6c6a43523e24ec76e2a46f86f1afc (as well as in Draft PR #18968.) Then no other will directly cause a miss-sync.
But why changing scalar.hpp
As you’re aware, I’ve replaced Thrust with extensive reuse of cudf::reduction::detail::reduce. These changes are bundled into this patch, as an in-place modification in cpp/include/cudf/reduction/detail/reduction.cuh isn’t feasible. This necessitated updates to cpp/include/cudf/scalar/scalar.hpp. If only in-place modification in reduction.cuh, the code compiles but fails at linking with undefined symbol errors: https://github.com/rapidsai/cudf/pull/18968#issuecomment-2910707766
I have an in-place modification for reproduction at this separate branch: https://github.com/JigaoLuo/cudf/tree/inplace-change-but-symbolerror-timestampscalar
The linking error looks like this:
/home/jluo/cudf-dev/cpp/build/gtests/STREAM_UNARY_TEST: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEE
I’m unsure if this is an expected error: a simple change in reduction is triggering issues with timestamp_scalar, which seems unrelated. To resolve this linking error, I modified these files together in this 7f0785a3ce2f959c91f5f2e3eab26546aec578b8 of this PR. While not ideal, I couldn’t identify a cleaner fix given the complexity. Let me know if you need further clarification!
The full output running unit tests (just repeating the same error): LastTest.log
Addressing https://github.com/rapidsai/rmm/issues/1955 would help improve this issue with rmm directly (we shouldn't block this PR on that making that change though, it's a long-term suggestion for improvement).
Thank you for the PR! Some of the changes need to be reverted, the rest looks good 👍 Do you know which of these caused pageable copies in the Parquet reader?
Hi @vuule ,
In the Vanilla Parquet reader, one direct miss-sync is caused in
cpp/include/cudf/detail/sizes_to_offsets_iterator.cuhand is addressed in my commit 3fab90c (as well as in Draft PR #18968.) Then no other will directly cause a miss-sync.But why changing
scalar.hppAs you’re aware, I’ve replaced Thrust with extensive reuse of
cudf::reduction::detail::reduce. These changes are bundled into this patch, as an in-place modification incpp/include/cudf/reduction/detail/reduction.cuhisn’t feasible. This necessitated updates tocpp/include/cudf/scalar/scalar.hpp. If only in-place modification inreduction.cuh, the code compiles but fails at linking with undefined symbol errors: #18968 (comment)I have an in-place modification for reproduction at this separate branch: https://github.com/JigaoLuo/cudf/tree/inplace-change-but-symbolerror-timestampscalar
The linking error looks like this:
/home/jluo/cudf-dev/cpp/build/gtests/STREAM_UNARY_TEST: symbol lookup error: /home/jluo/miniconda3/envs/cudf_dev/lib/libcudf.so: undefined symbol: _ZN4cudf16timestamp_scalarIN4cuda3std3__46chrono10time_pointINS4_12system_clockENS4_8durationIlNS3_5ratioILl1ELl1000EEEEEEEEC1INS_6detail13device_scalarISB_EEEERKT_bN3rmm16cuda_stream_viewENS1_2mr3__418basic_resource_refILNSN_10_AllocTypeE1EJNSN_17device_accessibleEEEEI’m unsure if this is an expected error: a simple change in reduction is triggering issues with
timestamp_scalar, which seems unrelated. To resolve this linking error, I modified these files together in this 7f0785a of this PR. While not ideal, I couldn’t identify a cleaner fix given the complexity. Let me know if you need further clarification!The full output running unit tests (just repeating the same error): LastTest.log
This is addressed in my recent commits. The fix is: add overloads for cudf::detail::device_scalar instead of replacing in the header
We’ve decided to continue the work there and have reopened the PR as a draft.
@vuule Thanks for reverting and triggering the tests. The same undefined symbol error from chrono_scalar stays, both in CI and on my local setup.
( I just cached the failing CI test here: https://github.com/rapidsai/cudf/actions/runs/16688029938/job/47242262880?pr=19119
/ok to test 9e7c8db
@vuule @davidwendt Thanks for helping!
I see the only remaining loose end is with rmm::device_scalar, but I also noticed there's a draft PR #19608 addressing that. I’ll just leave a note in the code review for logging purposes—you can feel free to ignore it.
Thanks, I’ve read through it again and I think it’s perfect for merging from my perspective.
One thing I had hoped to do about three months ago was to add a small piece of documentation in the markdown: https://github.com/rapidsai/cudf/pull/19119/commits/06ae1d2cdcede2ab6332da6b22bef88648301ebd#diff-6cc9f4135444fd2ed9062e7168a007f382c2c855fccc18e859834a5d1a6de278R655
- I'm not sure how feasible it is to include that now—should we try to add it, or just leave it as is?
@JigaoLuo Can you test/profile this change with your code? I think the mis-sync is likely still here.
I believe the scalar classes would need to change their internal _data type to cudf::detail::device_scalar as well.
https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L202
https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L415
Perhaps the mis-sync is only with the validity value then? https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L101
One thing I had hoped to do about three months ago was to add a small piece of documentation in the markdown: 06ae1d2#diff-6cc9f4135444fd2ed9062e7168a007f382c2c855fccc18e859834a5d1a6de278R655
- I'm not sure how feasible it is to include that now—should we try to add it, or just leave it as is?
Feel free to include this, I don't see why not.
@JigaoLuo Can you test/profile this change with your code? I think the mis-sync is likely still here.
I believe the scalar classes would need to change their internal
_datatype tocudf::detail::device_scalaras well.https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L202
https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L415
Perhaps the mis-sync is only with the validity value then?
https://github.com/rapidsai/cudf/blob/28613267714446c29a28183522fea86db24cae1c/cpp/include/cudf/scalar/scalar.hpp#L101
I can’t fully recall (sorry) why I missed the other rmm::device_scalar cases. It’s possible I simply overlooked them.
I believe the scalar classes would need to change their internal
_datatype tocudf::detail::device_scalaras well.
- I completely agree with your point; I may have just forgotten to fix those.
- I'm considering replacing them with cudf's
device_scalar, which aligns with the intention of this PR. Do we need consensus on this before proceeding?
Regarding the test/profiling, I’m not entirely sure what’s being proposed or how I should proceed with testing and profiling.
- In the original patch, my goal was to identify all instances of pageable memory within the Parquet reading path.
- The remaining loose ends in
rmm::device_scalarmight be used in other non-Parquet functions that still involve pageable memcpy. That’s why I’m uncertain about how to design meaningful tests or profiling for those cases.
- I'm considering replacing them with cudf's
device_scalar, which aligns with the intention of this PR. Do we need consensus on this before proceeding?
I don't think that is necessary for this PR to proceed.
Regarding the test/profiling, I’m not entirely sure what’s being proposed or how I should proceed with testing and profiling.
- In the original patch, my goal was to identify all instances of pageable memory within the Parquet reading path.
I had assumed you had found the mis-sync areas in the parquet reading path and verified this change fixed them. I was just suggesting re-doing that analysis based on the now current changes. If that does not make sense then I apologize and you can ignore this.
If that does not make sense then I apologize and you can ignore this.
No worries at all!
Let me briefly summarize what I’ve done: I identified the mis-synchronization areas in the Parquet reader and addressed them all. These fixes are all collected in the draft PR #18968 to demonstrate the overall improvement. The current PR is just a small subset of that draft and it only resolves part of the mis-sync issues.
Regarding profiling: I can run the same command I used earlier in https://github.com/rapidsai/cudf/pull/18968#issuecomment-2919937779 to show a simple count of mis-sync occurrences before and after this patch. It’s a quick task, and I think it’s worth doing.
This is just an example to count cuda_memcpy_async with pageable memory:
$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
166861
@vuule Would you please describe the next steps before we are ready to merge?
@vuule Would you please describe the next steps before we are ready to merge?
I think the code changes can be merged as they are now, I'm just waiting for @JigaoLuo to add the change described here to proceed with the (final) reviews.
I think the code changes can be merged as they are now
I think this should better wait until https://github.com/rapidsai/cudf/pull/19608 merges in, then rebase before merging.
I’ve reverted the previous changes in DEVELOPER_GUIDE.md.
Additionally, I ran a similar profiling to https://github.com/rapidsai/cudf/pull/18968#issuecomment-2919937779 and observed a slight drop in the number of mis-sync:
The command:
for t in 1 2 4 8 16 32 64 128;
do
./PARQUET_MULTITHREAD_READER_NVBENCH -d 0 -b 0 --axis num_cols=32 --axis run_length=2 --axis total_data_size=$((1024 * 1024 * 128 * t)) --axis num_threads=$t --axis num_iterations=10 --csv <PATH>;
done
# then with nsys profile
nsys export -t sqlite report{thread-case}.nsys-rep
nsys analyze -r cuda_memcpy_async:rows=-1 report{thread-case}.sqlite| wc -l
The current branch 25.10: Before this PR
$ nsys analyze -r cuda_memcpy_async:rows=-1 report_upstream_8threads.sqlite | wc -l
63181
With this PR
$ nsys analyze -r cuda_memcpy_async:rows=-1 report_PR_8threads.sqlite | wc -l
63021
The target number we're aiming for is the one mentioned in the draft PR comment : ~301, based on the same binary benchmark run.
I don’t recall the exact impact of each individual change on the reduction of mis-syncs, but from what I observed, the thrust reduction seemed to trigger the most mis-synchronizations. There are also a lot of calls of thrust reduction.
Apologies for the delay—I wasn’t able to find time over the past two days. I also have a major paper deadline coming up, so I want to apologize in advance for any slow responses.
Now I turn this PR Draft into a PR. But let's wait for the #19608
Ok, #19608 is merge. I think this will be good to go once the merge conflict is resolved here.
Thanks for the notification. I’ve resolved the conflict. Before merging, I’d also like to request a code review for the changes made in DEVELOPER_GUIDE.md.
/ok to test 6708c96
Thanks again for your help—really appreciate it!
Also, big thanks to everyone here, including RMM forks.
/ok to test 5fc341155d60628456b5d6fc5507276bb9a3412a
/merge