Build fails with nvshmem installed as RPM in /usr/lib64 — unable to create wheel

Open vshawrh opened this issue 4 months ago • 1 comments

Description

When trying to build this package against nvshmem provided as an RPM (installed under /usr/lib64 and /usr/include/nvshmem_${CUDA_MAJOR_VERSION}), the wheel build fails due to incorrect linking flags and library discovery issues.

Currently, the setup.py assumes local paths like ${nvshmem_dir}/lib and ${nvshmem_dir}/include. This breaks in RPM-based environments where nvshmem is installed system-wide.

Issues observed

Link flags not generic
- Existing extra_link_args used explicit -l:libnvshmem_host.so and -l:libnvshmem_device.a.
- These are brittle because they rely on filenames instead of sonames (-lnvshmem_host, -lnvshmem_device).
- Also, nvshmem_bootstrap_uid.so was incorrectly linked, but not actually required.
RPATH handling
- Previously only ${nvshmem_dir}/lib was added to -rpath.
- On RPM-based installs, libraries are under /usr/lib64/nvshmem/${CUDA_MAJOR_VERSION} and /usr/lib64, requiring explicit rpath.
Device linking
- nvcc_dlink was missing system paths for nvshmem_device, causing unresolved references during device code linking.
Wheel creation fails
- Since the build cannot resolve nvshmem_host and nvshmem_device properly, pip wheel . fails to produce a wheel on systems where nvshmem is installed as an RPM.
- Error messages include missing libnvshmem_host.so.3 and unresolved device symbols.

Changes needed

Use system include and library directories:

include_dirs.extend(['/usr/include', f'/usr/include/nvshmem_{os.getenv("CUDA_MAJOR_VERSION")}'])
library_dirs.extend(['/usr/lib64', f'/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}'])

Update linker flags to use sonames instead of filenames:

extra_link_args.extend([
    '-lnvshmem',
    '-Wl,--no-as-needed',
    '-lnvshmem_host',
    '-lnvshmem_device',
    f'-Wl,-rpath,/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}:/usr/lib64'
])

Update device linking with nvcc:

nvcc_dlink.extend([
    '-dlink',
    '-L/usr/lib64',
    f'-L/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}',
    '-lnvshmem_device'
])

Expected outcome

Build succeeds when nvshmem is installed from RPM.
Wheel (.whl) can be created and installed in a clean environment.
Linker resolves libnvshmem_host.so.3 and libnvshmem_device dynamically without hardcoding filenames.

Aug 28 '25 20:08 vshawrh

To make the build process more portable and user-configurable, I propose that we update setup.py to use environment variables for discovering NVSHMEM paths and linker flags. We could adopt an approach that prioritizes explicit environment variables, with a fallback mechanism for backward compatibility.

New Environment Variables:

NVSHMEM_INCLUDE_PATH: An os.pathsep-separated list of include directories for the NVSHMEM headers.
NVSHMEM_LIBRARY_PATH: An os.pathsep-separated list of library directories for the NVSHMEM libraries.
NVSHMEM_LDFLAGS: A string containing all necessary linker flags (e.g., -lnvshmem -Wl,--no-as-needed -lnvshmem_host -lnvshmem_device -Wl,-rpath,...).

Implement Fallback Logic:

To maintain compatibility with existing setups, if NVSHMEM_INCLUDE_PATH is not defined but NVSHMEM_DIR is, set NVSHMEM_INCLUDE_PATH = f"{NVSHMEM_DIR}/include".
Similarly, if NVSHMEM_LIBRARY_PATH is not defined but NVSHMEM_DIR is, set NVSHMEM_LIBRARY_PATH = f"{NVSHMEM_DIR}/lib".

Conditional Compilation:

The NVSHMEM extension should only be built if both NVSHMEM_INCLUDE_PATH and NVSHMEM_LIBRARY_PATH are successfully resolved (either directly or via the NVSHMEM_DIR fallback). This prevents build failures when NVSHMEM is not available or configured.

Note

I've done similar work in flashinfer-python: https://github.com/flashinfer-ai/flashinfer/commit/ce68e1d0cc8a69da4ead85a5280a183f3e2a5a00

Aug 29 '25 16:08 EmilienM