faiss-wheels
faiss-wheels copied to clipboard
Reduce wheel package size for faiss-gpu CUDA 11.0 build
The CUDA 11.0 build in #56 bloats the wheel package size from 85.5 MB to 216.5 MB. Needs to investigate file size reduction.
Relevant https://github.com/pytorch/pytorch/issues/56055
Seems one approach is to drop architecture-specific binary in CUDA libraries via nvprune
, like this:
nvprune \
-gencode arch=compute_60,code=sm_60 \
-gencode arch=compute_70,code=sm_70 \
-gencode arch=compute_75,code=sm_75 \
-gencode arch=compute_80,code=sm_80 \
-gencode arch=compute_80,code=compute_80 \
-o /usr/local/cuda/lib64/libcublas_static_slim.a \
/usr/local/cuda/lib64/libcublas_static.a
Currently there are four dependencies, and applying nvprune
slightly reduces the binary size.
-
libcublas_static.a
-
libcublasLt_static.a
-
libcudart_static.a
-
libculibos.a
In Python 3.9, the original file size of _swigfaiss.cpython-39-x86_64-linux-gnu.so
was 341MB, while applying nvprune
to all the static libs results in 310MB. This is still huge.
The major problem is that CUDA 11.0 splits cublasLt API into a different static lib, and that seems to significantly increase the final binary size. In CUDA 10.x, cublasLt API was within the single static lib.
libcublasLt_static.a 224M
libcublas_static.a 82M
libcudart_static.a 910K
libculibos.a 31K
Strangely, faiss does not use cublasLt
API. But when omitting -lcublasLt_static
in the linker flag of setup.py
, we see the following error on import. Why does that happen?
ImportError: /workspace/faiss-wheels/build/lib.linux-x86_64-3.9/faiss/_swigfaiss.cpython-39-x86_64-linux-gnu.so: undefined symbol: cublasLtMatrixTransformDescDestroy
Ok, changing the order of linker flag in setup.py
seems to reduce the binary size.
With CUDA 11.6, the resulting wheel further goes up to 345MB in Linux. After nvprune
, we get 276MB. This is still not good, as PyPI default limit is 60MB.
Alternative is to give up static linking and relies on dynamic linking. This will significantly reduce the wheel size, while requires users to install CUDA runtime libraries elsewhere.
With avx2 extension, the package is ~430MB.
It seems there are CUDA runtime packages on PyPI. https://pypi.org/project/nvidia-cuda-runtime-cu11/
Hi!
Did you consider to place package on GitLab PyPI index or place it to dockerhub as image?
ping me if you need help
@theLastOfCats You can manually download packages from the release page.
Hi @kyamagu!
For your reference, by changing from static linking to dynamic linking of CUDA, the wheel size has been reduced to 63MB. It was dynamically linked with the shared libraries of the nvidia-cublas-cu12 and nvidia-cuda-runtime-cu12 packages, which are published on PyPi
It seems possible to reduce the wheel size to less than 60MB by either narrowing down the target architecture or switching from static linking to dynamic linking of OpenBLAS.
Fork Repository: https://github.com/Di-Is/faiss-wheels/tree/pypi-cuda
Build Script
# Test CMD
CPU_TEST_CMD="pytest {project}/faiss/tests && pytest -s {project}/faiss/tests/torch_test_contrib.py"
GPU_TEST_CMD="cp {project}/faiss/tests/common_faiss_tests.py {project}/faiss/faiss/gpu/test/ && pytest {project}/faiss/faiss/gpu/test/test_*.py && pytest {project}/faiss/faiss/gpu/test/torch_*.py"
# Common Setup
export CIBW_BEFORE_ALL="bash scripts/build_Linux.sh"
export CIBW_TEST_COMMAND="${CPU_TEST_CMD}"
export CIBW_BEFORE_TEST_LINUX="pip install torch --index-url https://download.pytorch.org/whl/cpu"
export CIBW_ENVIRONMENT_LINUX="FAISS_OPT_LEVEL=${FAISS_OPT_LEVEL:-generic} BUILD_PARALLELISM=${BUILD_PARALLELISM:-3} CUDA_VERSION=12.1"
export CIBW_DEBUG_KEEP_CONTAINER=TRUE
if [ "$FAISS_ENABLE_GPU" = "ON" ]; then
if [ "$CONTAINER_GPU_ACCESS" = "ON" ]; then
export CIBW_TEST_COMMAND="${CIBW_TEST_COMMAND} && ${GPU_TEST_CMD}"
export CIBW_CONTAINER_ENGINE="docker; create_args: --gpus all"
export -n CIBW_BEFORE_TEST_LINUX
fi
export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=ON"
export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel} --exclude libcublas.so.12 --exclude libcublasLt.so.12 --exclude libcudart.so.12"
else
export CIBW_ENVIRONMENT_LINUX="${CIBW_ENVIRONMENT_LINUX} FAISS_ENABLE_GPU=OFF"
export CIBW_REPAIR_WHEEL_COMMAND="auditwheel repair -w {dest_dir} {wheel}"
fi
python3 -m cibuildwheel --output-dir wheelhouse --platform linux
@Di-Is CUDA backward compatibility is complicated, and the PyPI release should not expect any external dependency other than a few linked to CPython binary. https://github.com/pypa/manylinux
You can build a source package for your environment, but that wheel will not be compatible with other environments.
Relevant thread https://discuss.python.org/t/what-to-do-about-gpus-and-the-built-distributions-that-support-them/7125/58
I believe that installing the appropriate Nvidia drivers is not a matter of package management but rather a part of system setup, and the responsibility for execution lies with the user. (This is also true for other package managers, e.g., Conda.) Fortunately, installing the latest drivers will work with any version of CUDA and the binaries linked to it.
the PyPI release should not expect any external dependency other than a few linked to CPython binary.
It is correct that wheel files should be self-contained. However, regarding this matter, it has been discussed in an auditwheel issue #368, and a feature to relax the restrictions has been merged into auditwheel.
You can build a source package for your environment, but that wheel will not be compatible with other environments.
If the following conditions are met, Faiss installed from the created wheel should work properly.
- Run Faiss in an environment with an Nvidia Driver installed that is compatible with the CUDA being used.
- Do not load multiple versions of CUDA shared libraries in a single process (to avoid troublesome issues like symbol conflicts).
1.As mentioned earlier, it is the user's responsibility. 2.The system/package configuration should be reviewed, I believe.
@Di-Is
However, regarding this matter, it has been discussed in an auditwheel issue https://github.com/pypa/auditwheel/pull/368#issuecomment-1274911692, and a feature to relax the restrictions has been merged into auditwheel.
This is not a matter of auditwheel
but more fundamental issues in Python dependency management. In the current PyPI policy, managing GPU dependency is hard unless there is a standardized toolchain to build and test wheels for combinations of compiler / CUDA / driver / CPU arch / OS / Python versions, and recently, the compatibility with other packages like PyTorch. At least the current PyPI distribution is not designed well for different CUDA runtimes. If we ignore that and ship wheels for a very specific runtime configuration, we end up seeing a flood of error reports both here and in the upstream, which is obviously not a good thing. Conda is different from PyPI in that conda does manage runtime environments (e.g., CUDA).
My current approach is to at least leave the source distribution that works with any custom environment. Right now, I can't spend time on the GPU binary distribution, but you can try designing a build and test matrix to resolve the issues in the above configurations.