rmm icon indicating copy to clipboard operation
rmm copied to clipboard

Avoid vendoring NVTX

Open bdice opened this issue 6 months ago • 4 comments

Description

Addresses part of #1833.

Checklist

  • [x] I am familiar with the Contributing Guidelines.
  • [x] New or existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

bdice avatar May 28 '25 13:05 bdice

I believe the failures on CUDA 11 are indicative of a problem in include paths. The failures look like this:

│ │ sccache $BUILD_PREFIX/bin/nvcc -forward-unknown-to-host-compiler -DCUB_DISABLE_NAMESPACE_MAGIC -DCUB_IGNORE_NAMESPACE_MAGIC_ERROR -DLIBCUDACXX_ENABLE_EXPERIMENTAL_MEMORY_RESOURCE -DRMM_LOG_ACTIVE_LEVEL=RAPIDS_LOGGER_LOG_LEVEL_INFO -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_DISABLE_ABI_NAMESPACE -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DTHRUST_IGNORE_ABI_NAMESPACE_ERROR -I$SRC_DIR/cpp -I$SRC_DIR/cpp/include -I$SRC_DIR/cpp/build/include -isystem $SRC_DIR/cpp/build/_deps/gtest-src/googlemock/include -isystem $SRC_DIR/cpp/build/_deps/gtest-src/googlemock -isystem $SRC_DIR/cpp/build/_deps/gtest-src/googletest/include -isystem $SRC_DIR/cpp/build/_deps/gtest-src/googletest -isystem /usr/local/cuda/targets/x86_64-linux/include -isystem $PREFIX/include -isystem $BUILD_PREFIX/include -O3 -DNDEBUG -std=c++17 "--generate-code=arch=compute_70,code=[sm_70]" "--generate-code=arch=compute_75,code=[sm_75]" "--generate-code=arch=compute_80,code=[sm_80]" "--generate-code=arch=compute_86,code=[sm_86]" "--generate-code=arch=compute_90,code=[compute_90,sm_90]" -Xcompiler=-fPIE "-Xcompiler=-ffile-prefix-map=$PREFIX/=''" -MD -MT tests/CMakeFiles/THRUST_ALLOCATOR_TEST.dir/mr/device/thrust_allocator_tests.cu.o -MF tests/CMakeFiles/THRUST_ALLOCATOR_TEST.dir/mr/device/thrust_allocator_tests.cu.o.d -x cu -c $SRC_DIR/cpp/tests/mr/device/thrust_allocator_tests.cu -o tests/CMakeFiles/THRUST_ALLOCATOR_TEST.dir/mr/device/thrust_allocator_tests.cu.o
 │ │ In file included from $PREFIX/include/cuda/stream_ref:53,
 │ │                  from $SRC_DIR/cpp/include/rmm/cuda_stream_view.hpp:21,
 │ │                  from $SRC_DIR/cpp/include/rmm/mr/device/system_memory_resource.hpp:19,
 │ │                  from $SRC_DIR/cpp/tests/mr/device/test_utils.hpp:20,
 │ │                  from $SRC_DIR/cpp/tests/mr/device/mr_ref_test.hpp:20,
 │ │                  from $SRC_DIR/cpp/tests/mr/device/thrust_allocator_tests.cu:17:
 │ │ $PREFIX/include/cuda/std/__cuda/api_wrapper.h:24:24: error: missing binary operator before token "("
 │ │    24 | #if _CCCL_CUDA_COMPILER(CLANG)
 │ │       |                        ^

but it's caused by the build finding the wrong CCCL headers, probably due to the ordering here:

-isystem /usr/local/cuda/targets/x86_64-linux/include -isystem $PREFIX/include

I'll need to investigate how to make the conda prefix include/ go before the conda CUDA packages' include/ if we want this to discover the CCCL conda package that we installed.

bdice avatar May 28 '25 21:05 bdice

I'm fine doing this if we can get it working now. I forgot the plan from https://github.com/NVIDIA/cccl/issues/1939#issuecomment-2830993942 when I opened https://github.com/rapidsai/build-planning/issues/183, where I proposed waiting to do cccl until we moved to 3.0. Up to you, but I sort of anticipated things being easier if we waited to do this until we dropped cuda 11 (which I think you are observing now with the build failures). We could reduce the scope of this pr to just nvtx if you prefer.

vyasr avatar May 29 '25 01:05 vyasr

@vyasr Yes, I think scoping to just NVTX is a good start.

The problem is that the failure on CUDA 11 is actually just an indication that we have the wrong order of include directories. The same problem applies to CUDA 12, it just happens that the CUDA 12 included CCCL is closer to the CCCL (conda package) that we actually want, so it doesn’t fail to build. We are still getting a mix of CCCL versions.

I would like to try to solve this before we migrate to CCCL 3.0, just to iron out any other problems with the devendoring. It’s okay if we update to 3.0 first, either way is fine.

bdice avatar May 29 '25 01:05 bdice

OK makes sense. Now that we've started the CUDA 11 drops we'll need to make sure that we remember to debug this issue since we won't observe it in CI anymore. In the meantime scoping this PR to only focus on NVTX seems sensible. Maybe open an issue for the CCCL include order bit so that we don't forget?

vyasr avatar May 30 '25 17:05 vyasr

As I think more about this, I see some tension. We would prefer a hard pinning on nvtx-c so that we match exactly what is in rapids-cmake's versions.json. However, with the way that librmm is used today, it is a dependency of all of RAPIDS. That would make it impossible for someone to use a different nvtx-c pinning in their environment if librmm pins it. We have a few ways we could solve this -- more ideas would be helpful.

Some options:

  • Loosen the pinning and allow it to float forward (nvtx-c >=3.2.0)
    • I think rapids-cmake will accept newer versions if found at build time.
    • This might be okay for NVTX but might not work well for CCCL. A consistent solution would be ideal.
  • Separate librmm into a -dev package and a runtime package (https://github.com/rapidsai/build-planning/issues/46)

bdice avatar Jun 25 '25 14:06 bdice

  • This might be okay for NVTX but might not work well for CCCL. A consistent solution would be ideal.

Might not work well for CCCL or won't work well for CCCL? This seems more straightforward than splitting the package (but if we'll have to end up splitting it in the near-future anyway, we might as well bite the bullet)

gforsyth avatar Jun 25 '25 14:06 gforsyth

An unpinned CCCL would make building and testing too difficult. We need CCCL pinned at build time. But it’s hard to have pinned dependencies for packages that are mostly header-only. Basically you have to split to a “-dev” package or you have to tell the consumer to supply their own NVTX and CCCL. That is also an okay outcome with me. Maybe we should start there - pin at build time and add a run dependency that is unpinned. If consumers have struggles with building downstream they can add their own constraints.

bdice avatar Jun 25 '25 15:06 bdice

Maybe we should start there - pin at build time and add a run dependency that is unpinned. If consumers have struggles with building downstream they can add their own constraints.

That seems like a reasonable division of labor. Installing packages that use rmm doesn't unnecessarily constrain the environment, while slightly more care is required to use it as a build dependency.

gforsyth avatar Jun 25 '25 17:06 gforsyth

Is it possible/realistic to move all uses of nvtx out of rmm headers and into implementation files (as part of the move to rmm as a shared library)? Historically the reason that we have needed to export a dependency on nvtx is that rmm is header-only and so the nvtx usage in headers is effectively "public", but since all we're using it for is ranges in theory we should be able to move all that logic into source files and isolate the dependency from consumers altogether.

That doesn't solve the problem for CCCL since we actually pass CCCL types across ABI boundaries. That's a problem for another day.

vyasr avatar Jun 25 '25 21:06 vyasr

I think that’s worth exploring. There are sizable parts that are headers because they’re templated, so I’d want to explore the device_uvector and other templated types’ usage of NVTX first.

bdice avatar Jun 25 '25 22:06 bdice

Agreed, the templating makes this challenging. I don't know if it is feasible to completely avoid exporting some NVTX dependency to consumers, but I would like us to explore how far we can go with that.

vyasr avatar Jun 30 '25 17:06 vyasr