cuvs icon indicating copy to clipboard operation
cuvs copied to clipboard

[FEA] reduce libcuvs binary size

Open jameslamb opened this issue 9 months ago • 0 comments

Is your feature request related to a problem? Please describe.

As of this writing, the latest 25.04 libcuvs wheels have the following sizes:

distribution arch size (compressed) size (uncompressed)
libcuvs-cu11 x86_64 948 MiB 1377 MiB
libcuvs-cu11 aarch64 999 MiB 1378 MiB
libcuvs-cu12 x86_64 1129 MiB 1626 MiB
libcuvs-cu12 aarch64 1128 MiB 1628 MiB

NOTE: v25.4.0a84, latest 25.04 nightly as of Feb 27, 2025

how I got those sizes (click me)

Downloaded wheels from https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/libcuvs-cu12 and https://pypi.anaconda.org/rapidsai-wheels-nightly/simple/libcuvs-cu11.

pydistcheck \
  --inspect \
  --output-file-size-unit Mi \
  --output-file-size-precision 1 \
  --select 'distro-too-large-compressed' \
  ./libcuvs-wheels/*.whl

Such large packages require a relatively large amount of network bandwidth to download and disk space to store.

This issue tracks the work of reducing those sizes.

Describe the solution you'd like

Those package sizes should be reduced. It's tough to say how small is small enough, but the goal "get wheels on pypi.org" provides one set of targets.

PyPI requirements for individual files:

  • over 100 MiB = requires an exception (see https://github.com/pypi/support/issues)
  • over 1GiB = absolutely not allowed

There are also limits for the total size of a project, summed over all releases... so smaller packages = more releases that can be hosted on PyPI.

Describe alternatives you've considered

Some ideas for how to address this:

  • compiling fewer combinations of templates
    • https://github.com/rapidsai/cuvs/issues/110
    • similar change in cugraph: https://github.com/rapidsai/cugraph/pull/4720
  • tuning compiler flags to optimize for binary size
  • checking the wheel contents and removing any extraneous files found
  • replacing static linking / vendoring with dynamic linking where possible
    • e.g., relying on nvidia-nccl-cu{11,12} wheels instead of vendoring a copy of libnccl.so
  • code changes to avoid recompiling the same kernels multiple times
    • related conversation: https://github.com/rapidsai/cuvs/issues/634

Additional context

Attaching a report @robertmaynard put together showing the approximate contribution to total binary size of each CUDA kernel in libcuvs.

size.log

jameslamb avatar Feb 27 '25 22:02 jameslamb