bitsandbytes BnB a disk space hog

System Info

somehow BnB likes to bring with it libs for all possible cuda versions. It makes it the largest lib in my env after torch, with 300+ Mb disk use (in each env!). is this really necessary? Is there a magic install parameter to avoid this?

Below are the largest files in bnb folder in my env:

ncdu 1.12                                                                   
--- /home/****/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes --------------
 Total disk usage: 324.4 MiB  Apparent size: 324.2 MiB  Items: 110                                                                        
   25.3 MiB [##########]  libbitsandbytes_cuda118_nocublaslt.so
   24.6 MiB [######### ]  libbitsandbytes_cuda123_nocublaslt.so
   24.6 MiB [######### ]  libbitsandbytes_cuda122_nocublaslt.so
   24.5 MiB [######### ]  libbitsandbytes_cuda121_nocublaslt.so
   24.5 MiB [######### ]  libbitsandbytes_cuda120_nocublaslt.so
   20.0 MiB [#######   ]  libbitsandbytes_cuda114_nocublaslt.so
   20.0 MiB [#######   ]  libbitsandbytes_cuda115_nocublaslt.so
   19.8 MiB [#######   ]  libbitsandbytes_cuda117_nocublaslt.so
   19.3 MiB [#######   ]  libbitsandbytes_cuda111_nocublaslt.so
   14.2 MiB [#####     ]  libbitsandbytes_cuda118.so
   13.9 MiB [#####     ]  libbitsandbytes_cuda123.so
   13.9 MiB [#####     ]  libbitsandbytes_cuda122.so
   13.8 MiB [#####     ]  libbitsandbytes_cuda121.so
   13.8 MiB [#####     ]  libbitsandbytes_cuda120.so
   10.6 MiB [####      ]  libbitsandbytes_cuda110_nocublaslt.so
    8.9 MiB [###       ]  libbitsandbytes_cuda114.so
    8.9 MiB [###       ]  libbitsandbytes_cuda115.so
    8.7 MiB [###       ]  libbitsandbytes_cuda117.so
    8.6 MiB [###       ]  libbitsandbytes_cuda111.so
    5.7 MiB [##        ]  libbitsandbytes_cuda110.so

also compare it with GPTQ libs:

$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/auto_gptq -s
832K    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/auto_gptq
$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/optimum -s
3.4M    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/optimum
$ du -h ~/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes -s
325M    /home/optimus/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes

Reproduction

intall bnb with pip, check disk use

Expected behavior

taking much less space

Mar 13 '24 18:03 poedator

I have a draft PR #1103 as a consideration to help slim this down, but need to conduct testing and validate it.

There's also some discussion on this here: https://github.com/TimDettmers/bitsandbytes/issues/1032#issuecomment-1927620339

As of the latest v0.43.0 release, we dropped the shipped binaries to include only 11.7 - 12.3, but there's more work to be done.

Mar 13 '24 22:03 matthewdouglas

Yeah, we're working on slimming this down, but there's a clear trade-off is between ease of installation and disk space. Two main factors add to the volume, CUDA version support and the binaries being "fat binaries", i.e. that each binary for each CUDA version is much "fatter" as it includes the symbols for all compute capabilities. Both CUDA version and Compute Capability is something that is not detected by pip (please correct me if I'm wrong) and we can't therefore package different wheels while enabling a simple pip install bitsandbytes.

With Conda at least the detection of the CUDA installation seems possible, but this is quite the rabbit hole and tricky. We might look into that later.

Anyways, when compiling from source you can pass CLI args to CMake and specify just the CUDA version and CC that you need for your installation and GPU model. This will give you a very reasonably sized binary.

Another factor is that higher performance optimization when compiling gives larger binaries, partly due to inlining. We already chose only the second highest optimization setting, as a trade-off.

#1103 that @matthewdouglas mentioned is trying to simplify things by making sure that we only need to compile for each major CUDA version, which would slim things down potentially only to two binaries. This still needs thorough review and testing though. Hopefully ready for the next release or the one thereafter.

Anyone reading this, please let us know if you have any info that's not already mentioned here that could help us improve the status quo.

Mar 14 '24 09:03 Titus-von-Koeller

Hmm, I wonder if this really needs to remain an open issue or if we could move this discussion to #1032 or a discussion in Github Discussions dev corner (I can transform the issue to that). Wdyt?

Mar 14 '24 10:03 Titus-von-Koeller

as a temporary measure, is it safe for user to delete manually all non-relevant versions from /site-packages/bitsandbytes ? like this (for keeping only cuda 12.1 versions):

cd ~/conda/envs/py38/lib/python3.8/site-packages/bitsandbytes
find . -type f | grep -e libbitsandbytes_cuda | grep -v 121 | xargs rm

Mar 14 '24 11:03 poedator

@poedator Yes, that should be safe to do.

As @Titus-von-Koeller mentions, each compute capability that is included adds weight as we're shipping fat binaries compiled for >=Maxwell. Each of these seems to add ~2-3MB to the overall size.

Here's what was shipped in v0.43.0:

CUDA	Targets
11.7.1	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, compute_86
11.8.0	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, compute_89
12.0.1	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.1.1	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.2.2	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90
12.3.2	sm_50, sm_52, sm_60, sm_61, sm_70, sm_75, sm_80, sm_86, sm_89, sm_90, compute_90

In https://github.com/TimDettmers/bitsandbytes/issues/1032#issuecomment-1927620339 I had proposed that we drop CUDA < 11.7, and try to align better with PyTorch's binary distributions. CUDA 11.7, 11.8, and 12.1 matches us with their distributions from torch>=1.13.0. The other versions are still there, for now, to bring parity with prior bitsandbytes releases.

bitsandbytes is shipped such that the end user shouldn't actually need the whole NVIDIA CUDA Toolkit and compiler toolchain in order to install and run. PyTorch's binary distributions come with all of the CUDA libraries we need at runtime. The problem has generally been with locating these at runtime. We tend to end up searching for CUDA toolkit installations instead, and that's part of why we have 12.0, 12.2, and 12.3 in the distribution. Colab for example has CUDA 12.2 installed now.

I'm looking at #1126 to make sure we try to load the libraries that come with PyTorch first before falling back to searching for CUDA libraries elsewhere. The point there is again that we want it to be much easier to install, and have broad compatibility across platforms and hardware, so there's a balancing act. But if we get that right, it should mean we can drop down to just those CUDA versions shipped with PyTorch and require others be built from source. That potentially shaves half of the binaries away.

Moving forward there are more options to explore, including:

#1103 would drop all of the _nocublaslt variants and rely on runtime dispatch.
Since CUDA 11.1+, there is supposed to be binary forward compatibility with minor toolkit versions. We need to test this out, but if everything works as expected, we could actually ship just one binary for 11.x and one for 12.x. The minimum driver version required is now constant across major releases of the toolkit.
We could consider slimming down the number of architectures we compile cubins for. For example, both sm_50 and sm_52 are Maxwell. Strictly speaking, a sm_50 cubin will still run on an sm_52 device. The same is the case for Pascal with sm_60 and sm_61. For Turing and newer we definitely want to build optimized cubins for each target, but it may be worth considering a change for Maxwell. In fact, sm_50 support was marked deprecated in the CUDA 11.0 release, which was nearly 4 years ago.

Mar 14 '24 13:03 matthewdouglas

@poedator Yes, that should be safe to do. Agreed.

Thanks @matthewdouglas for elaborating on all your current and upcoming work. Spelling out the details really helps in our shared understanding and getting other knowledgable people involved in fleshing out the tricky details! I'll engage on those topics with you more soon, once I have some other more urgent stuff out of the way.

Mar 14 '24 15:03 Titus-von-Koeller

@poedator on the basis of the above discussion, where it looks like folks are moving the codebase in the right direction and have clarified that manual deletion is fine, are you happy for this issue to be closed?

(I'm just trying to nudge down the total live-issue count on the basis that will improve contributor focus and bandwidth.)

Apr 01 '24 13:04 deep-pipeline

@poedator on the basis of the above discussion, where it looks like folks are moving the codebase in the right direction and have clarified that manual deletion is fine, are you happy for this issue to be closed?

(I'm just trying to nudge down the total live-issue count on the basis that will improve contributor focus and bandwidth.)

OK to close if this will get worked on in #1032

Apr 02 '24 08:04 poedator

Yes, I'll keep your feedback in mind when addressing these topics in the coming weeks/months and we'll try to come up with a solution that's more space-saving.

Thanks everyone for your collaborative spirit. Really appreciated.

Apr 04 '24 17:04 Titus-von-Koeller

bitsandbytes bitsandbytes copied to clipboard

BnB a disk space hog

System Info

Reproduction

Expected behavior

bitsandbytes
bitsandbytes copied to clipboard