cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[BUG] libcudf must customize the Thrust/CUB namespace

Open jrhemstad opened this issue 2 years ago • 12 comments

Describe the bug

For header-only libraries like Thrust/CUB, it can be problematic when an application inadvertently ends up with multiple or mismatched versions of the headers.

Thrust/CUB provide a way to universally customize the namespace such that a symbol like thrust::reduce will instead be thrust::CUSTOM::reduce.

This prevents inadvertent symbol collisions and version mismatches by forcing libcudf to use Thrust/CUB symbols from within its custom namespace.

Solution

libcudf should set THRUST_CUB_WRAPPED_NAMESPACE=cudf as part of the cmake configuration step for Thrust/CUB.

jrhemstad avatar Jul 27 '22 14:07 jrhemstad

Yes, we should do this. It was on my list of things to investigate after I upgrade RAPIDS to Thrust 1.17. (Also I think it’s an outer namespace like CUSTOM::thrust::reduce.)

bdice avatar Jul 27 '22 23:07 bdice

I agree we need to find a way to ensure that two distinct libraries that have no interaction can safely use Thrust without accidently causing issues due to ABI changes.

The problem I see with proposed approach is that it makes using Thrust types harder when two projects call into each other and have Thrust types in the public API like all of RAPIDS.

For example here is a super simplified example from cudf:

 #include <rmm/exec_policy.hpp>

 #include <thrust/optional.h>
 using namespace cudf;

 namespace cudf {
 namespace detail {
  ...
 }
 }

This code fails to compile since the thrust types used in <rmm/exec_policy.hpp> don't exist ( they are cudf::thrust ) and therefore the code fails. Likewise moving the using namespace above the <rmm/exec_policy.hpp> doesn't work as the namespace hasn't been created yet.

This means that at a minimum the namespace that we import thrust into has to be consistent across all of RAPIDS. To make that robust our logic should look like:


 #include <thrust/optional.h>
 #ifdef THRUST_WRAPPED_NAMESPACE
   using namespace THRUST_WRAPPED_NAMESPACE;
 #endif

 namespace cudf {
  ...
 }

Now that we have solved internal RAPIDS dependencies we should explore the impacts for callers of libcudf

  • I believe that we should treat this as a full ABI break. I see that in a couple of locations use thrust types as data members or parameters which might cause a break.
  • While the above ifdef guards will ensure that cudf headers always match the thrust namespace convention of the caller, I believe that to use to libcudf API that has Thrust types exposed they will also need to bring Thrust in as THRUST_WRAPPED_NAMESPACE::thrust and add the above using logic.
  • Everyones needs to be aware that this won't stop runtime ODR / ABI breaks due to compiling with different Thrust versions. It would be entirely possible to build libcudf with Thrust 1.17 and build a direct libcudf consumer with Thrust 1.15 as they both support the THRUST_WRAPPED_NAMESPACE syntax.

This isn't to say I oppose this work, we just need to make sure we understand the impact. To help with this effort here are my current draft PR's for rmm and cudf:

  • https://github.com/rapidsai/rmm/pull/1077
  • https://github.com/rapidsai/cudf/pull/11440

robertmaynard avatar Aug 02 '22 19:08 robertmaynard

have Thrust types in the public API like all of RAPIDS.

We should just get away from doing this. I don't know of cases where we use anything other than thrust::optional in a public API, and that can be replaced with std::optional now that we use C++17.

Everyones needs to be aware that this won't stop runtime ODR / ABI breaks due to compiling with different Thrust versions.

I don't think we're ever concerned with ABI breaks in RAPIDS libraries.

What kind of ODR issues do you foresee? If two versions of Thrust are present, each in different namespaces, then they are different symbols so far as the program is concerned and therefore distinct definitions of the symbol won't be a problem.

jrhemstad avatar Aug 03 '22 01:08 jrhemstad

If I understand correctly, every RAPIDS library would have a different definition of rmm::device_vector because it’s reused straight from Thrust. Would those types be interchangeable among RAPIDS libraries using rmm headers if the Thrust namespaces (and thus symbols) are different among each RAPIDS library? (Apologies if this explanation is too short, I can provide a longer form explanation of what I mean later if this is unclear)

bdice avatar Aug 03 '22 01:08 bdice

We should just get away from doing this.

I agree. I will look at changing the offending code to not expose thrust

I don't think we're ever concerned with ABI breaks in RAPIDS libraries.

What kind of ODR issues do you foresee? If two versions of Thrust are present, each in different namespaces, then they are different symbols so far as the program is concerned and therefore distinct definitions of the symbol won't be a problem.

This doesn't stop two different versions of Thrust being compiled using the same namespace. So cudf and a consumer could use differing versions of thrust and place it in the same namespace. This might happen due to a leakage of the THRUST_WRAPPED_NAMESPACE value to the consumer.

Would those types be interchangeable among RAPIDS libraries using rmm headers if the Thrust namespaces (and thus symbols) are different among each RAPIDS library?

The won't be interchangable. We can see that when looking at the current cudf PR. The STRINGS_TEST fail to link due to the fact that it thinks cudf::make_strings_column(cudf::device_span<thrust::pair ...) should exist but the signature is actually cudf::make_strings_column(cudf::device_span<rapids::thrust::pair ...).

One of the reasons I am proposing a common prefix is that interop between libraries ( cudf / raft / cuml / cugraph ) is possible when they are used within the same TU. If each project goes with a unique thrust namespace, no interop that includes thrust types will be possible,

robertmaynard avatar Aug 03 '22 18:08 robertmaynard

Excellent, the common namespace prefix seems to be essential here. However, it makes me wonder if this truly solves any problems (compared with no namespace) because some rmm types (for example) will not be compatible with non-RAPIDS libraries. It’s a slight improvement but also partially shifts the problem to require that non-RAPIDS libraries match the Thrust namespace prefix and version OR avoid using such Thrust-derived types from RAPIDS libraries. I suppose we want to remove Thrust APIs from our public interfaces so perhaps this is solved in that way.

bdice avatar Aug 03 '22 18:08 bdice

However, it makes me wonder if this truly solves any problems (compared with no namespace) because some rmm types (for example) will not be compatible with non-RAPIDS libraries.

It does solve a real problem which is that end user programs bring RAPIDS projects together with other C++ libraries that use thrust. Since these projects are built ignorant of each other they use different thrust versions and cause weird runtime failures.

To be clear the proposal for rmm won't impact the ability for non RAPIDS projects to use it. The rmm headers will adapt based on the value of THRUST_WRAPPED_NAMESPACE. What they lose is the ability to use something like raft + rmm + some thirdparty library ( non-header only ) that uses thrust in the same file ( TU ).

robertmaynard avatar Aug 03 '22 19:08 robertmaynard

it thinks cudf::make_strings_column(cudf::device_span<thrust::pair ...) should exist but the signature is actually cudf::make_strings_column(cudf::device_span<rapids::thrust::pair ...).

The THRUST_NS_QUALIFIER macro may simplify things here:

cudf::make_strings_column(cudf::device_span<THRUST_NS_QUALIFIER::pair ...)

It expands to ::thrust by default, but rapids::thrust when THRUST_CUB_WRAPPED_NAMESPACE=rapids.

I suppose we want to remove Thrust APIs from our public interfaces so perhaps this is solved in that way.

I strongly support this. The namespace macros are only robust when Thrust isn't exposed publicly.

alliepiper avatar Aug 03 '22 19:08 alliepiper

(Thanks for the clarification @robertmaynard / @allisonvacanti! My previous message may have sounded negative on this idea - poor writing on my part.)

bdice avatar Aug 03 '22 19:08 bdice

fwiw, the original impetus for this issue was due to mixing CUB versions in a C++ code using both libcudf and CUB headers.

We definitely don't have any CUB types as part of any public interface, so if it's easier to just custom scope CUB and not Thrust in the short term then I think that would be valuable to do.

jrhemstad avatar Aug 03 '22 20:08 jrhemstad

if it's easier to just custom scope CUB and not Thrust in the short term then I think that would be valuable to do.

This can be done in the current macro system by just changing CUB_WRAPPED_NAMESPACE instead of THRUST_CUB_WRAPPED_NAMESPACE. If the conflicts are just in CUB and CUB is not exposed in any cudf headers, this should work as a temporary fix.

I'm not sure how realistic that "CUB is not exposed" bit is, though. Thrust's headers indirectly pull in CUB headers often, so there may still be issues.

alliepiper avatar Aug 04 '22 16:08 alliepiper

I am working on a pr draft of cudf using cub in a namespace.

robertmaynard avatar Aug 04 '22 19:08 robertmaynard

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] avatar Sep 03 '22 20:09 github-actions[bot]

With a solution to https://github.com/NVIDIA/cub/issues/545 being merged, I think the best option for cudf is to patch our version of thrust.

robertmaynard avatar Sep 08 '22 15:09 robertmaynard

With a solution to NVIDIA/cub#545 being merged, I think the best option for cudf is to patch our version of thrust.

Can't we just use a version of Thrust that makes the same changes? https://github.com/NVIDIA/thrust/issues/1788

jrhemstad avatar Sep 09 '22 20:09 jrhemstad

Can't we just use a version of Thrust that makes the same changes? NVIDIA/thrust#1788

No released version of Thrust currently exists have the required changes. Once Thrust 2.1 has been released ( or the patches are back ported ) we can stop patching our Thrust.

robertmaynard avatar Sep 09 '22 20:09 robertmaynard

No released version of Thrust currently exists have the required changes. Once Thrust 2.1 has been released ( or the patches are back ported ) we can stop patching our Thrust.

If we're going to go through the work of making the changes locally for a patch, why not just contribute the changes to Thrust and then update our version of Thrust?

jrhemstad avatar Sep 09 '22 21:09 jrhemstad

Sounds reasonable we can use the 1.17.X branch once the following PR's are merged

  • https://github.com/NVIDIA/cub/pull/572
  • https://github.com/NVIDIA/thrust/pull/1792

robertmaynard avatar Sep 09 '22 21:09 robertmaynard

@robertmaynard, @jrhemstad, mentioned PRs are merged.

gevtushenko avatar Sep 12 '22 14:09 gevtushenko

@senior-zero Will a tag be made for these changes? Our CPM configuration in cuDF relies on a tag right now. I'd use branch 1.17.X but I would prefer for the version to be "fixed" like a tag or commit hash, and I don't think commit hashes are enough for us to shallow clone the repo (which saves network time and disk space).

bdice avatar Sep 12 '22 19:09 bdice

@bdice we'll create a tag soon.

gevtushenko avatar Sep 13 '22 12:09 gevtushenko

@bdice @allisonvacanti has just created 1.17.2 tag.

gevtushenko avatar Sep 13 '22 17:09 gevtushenko