DeepEP icon indicating copy to clipboard operation
DeepEP copied to clipboard

Build fails with nvshmem installed as RPM in /usr/lib64 — unable to create wheel

Open vshawrh opened this issue 4 months ago • 1 comments

Description

When trying to build this package against nvshmem provided as an RPM (installed under /usr/lib64 and /usr/include/nvshmem_${CUDA_MAJOR_VERSION}), the wheel build fails due to incorrect linking flags and library discovery issues.

Currently, the setup.py assumes local paths like ${nvshmem_dir}/lib and ${nvshmem_dir}/include. This breaks in RPM-based environments where nvshmem is installed system-wide.

Issues observed

  1. Link flags not generic

    • Existing extra_link_args used explicit -l:libnvshmem_host.so and -l:libnvshmem_device.a.
    • These are brittle because they rely on filenames instead of sonames (-lnvshmem_host, -lnvshmem_device).
    • Also, nvshmem_bootstrap_uid.so was incorrectly linked, but not actually required.
  2. RPATH handling

    • Previously only ${nvshmem_dir}/lib was added to -rpath.
    • On RPM-based installs, libraries are under /usr/lib64/nvshmem/${CUDA_MAJOR_VERSION} and /usr/lib64, requiring explicit rpath.
  3. Device linking

    • nvcc_dlink was missing system paths for nvshmem_device, causing unresolved references during device code linking.
  4. Wheel creation fails

    • Since the build cannot resolve nvshmem_host and nvshmem_device properly, pip wheel . fails to produce a wheel on systems where nvshmem is installed as an RPM.
    • Error messages include missing libnvshmem_host.so.3 and unresolved device symbols.

Changes needed

  • Use system include and library directories:

    include_dirs.extend(['/usr/include', f'/usr/include/nvshmem_{os.getenv("CUDA_MAJOR_VERSION")}'])
    library_dirs.extend(['/usr/lib64', f'/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}'])
    
  • Update linker flags to use sonames instead of filenames:

    extra_link_args.extend([
        '-lnvshmem',
        '-Wl,--no-as-needed',
        '-lnvshmem_host',
        '-lnvshmem_device',
        f'-Wl,-rpath,/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}:/usr/lib64'
    ])
    
  • Update device linking with nvcc:

    nvcc_dlink.extend([
        '-dlink',
        '-L/usr/lib64',
        f'-L/usr/lib64/nvshmem/{os.getenv("CUDA_MAJOR_VERSION")}',
        '-lnvshmem_device'
    ])
    

Expected outcome

  • Build succeeds when nvshmem is installed from RPM.
  • Wheel (.whl) can be created and installed in a clean environment.
  • Linker resolves libnvshmem_host.so.3 and libnvshmem_device dynamically without hardcoding filenames.

vshawrh avatar Aug 28 '25 20:08 vshawrh

To make the build process more portable and user-configurable, I propose that we update setup.py to use environment variables for discovering NVSHMEM paths and linker flags. We could adopt an approach that prioritizes explicit environment variables, with a fallback mechanism for backward compatibility.

New Environment Variables:

  • NVSHMEM_INCLUDE_PATH: An os.pathsep-separated list of include directories for the NVSHMEM headers.
  • NVSHMEM_LIBRARY_PATH: An os.pathsep-separated list of library directories for the NVSHMEM libraries.
  • NVSHMEM_LDFLAGS: A string containing all necessary linker flags (e.g., -lnvshmem -Wl,--no-as-needed -lnvshmem_host -lnvshmem_device -Wl,-rpath,...).

Implement Fallback Logic:

  • To maintain compatibility with existing setups, if NVSHMEM_INCLUDE_PATH is not defined but NVSHMEM_DIR is, set NVSHMEM_INCLUDE_PATH = f"{NVSHMEM_DIR}/include".
  • Similarly, if NVSHMEM_LIBRARY_PATH is not defined but NVSHMEM_DIR is, set NVSHMEM_LIBRARY_PATH = f"{NVSHMEM_DIR}/lib".

Conditional Compilation:

The NVSHMEM extension should only be built if both NVSHMEM_INCLUDE_PATH and NVSHMEM_LIBRARY_PATH are successfully resolved (either directly or via the NVSHMEM_DIR fallback). This prevents build failures when NVSHMEM is not available or configured.

Note

I've done similar work in flashinfer-python: https://github.com/flashinfer-ai/flashinfer/commit/ce68e1d0cc8a69da4ead85a5280a183f3e2a5a00

EmilienM avatar Aug 29 '25 16:08 EmilienM