bitsandbytes [RFC] Cross-Platform Refactor: CPU-only implementation

Motivation

As we want to have this library portable, the first step would be to make 100% of this library run correctly on only CPU (i.e. not requiring CUDA for any part of the functionality). This would serve two purposes:

Provide a baseline that contributors of ports can reference
Provide a fallback for partially implemented hardware platforms

Proposed solution

[ ] Implement all the CUDA kernels in "normal" C++
[ ] Make sure the unit tests all run on the CPU as well
[ ] Make sure unit test coverage is satisfactory

Open questions

Which CPU architectures do we support (x86_64 and arm64 are givens, but any more)?
How do we deal with SIMD intrinsics? Build separate libraries for each SIMD architecture? Or run-time selection based on CPU features?

@Titus-von-Koeller Feel free to edit this issue as you see fit, if you want a different structure for it for example.tbd

tbd

Feb 03 '24 09:02 rickardp

@rickardp Where are we on this feature ? It is some part already working, or another threads talking about this feature ?, not much comment here.

I'm especially interested about arm64 CPU only

Sep 06 '24 21:09 simepy

@rickardp Where are we on this feature ? It is some part already working, or another threads talking about this feature ?, not much comment here.

Hi @simepy, sorry not much to add here still. I am still up for contributing towards this when 1) I have time to do so and 2) the dependencies that I do not have time to contribute are ready to use. More specifically the idea is to take a gradual approach and use the reference implementation where MPS acceleration is not yet implemented. Currently, large parts of this codebase require CUDA, which does not run on Apple silicon, making a partial implementation virtually unusable.

Sep 10 '24 13:09 rickardp

We will be shipping a baseline CPU implementation. This was partially shipped in v0.46.0 and will be expanded in v0.47.0. However, this implementation is primarily PyTorch code instead of C++. In certain situations we may take advantage of torch.compile functionality, or kernels implemented in the IPEX library. We can also evaluate again in the future if there are certain ops worth the effort of writing a C++ kernel.

We've got good coverage in the CI at this point for both x86-64 and aarch64 CPUs. The main gap is with 8bit optimizers which are not currently implemented for CPU. There is an open issue for CPU support of AdamW8bit in #1226. I feel that we're at a point now that is satisfactory enough to close this issue and we can take a pointed approach in separate issues to close any remaining feature gaps.

As a side note, it may be possible to build and run on other platforms such as ppc64le, though we do not intend to provide official support for architectures beyond x86-64 and aarch64 at this point in time.

Jun 09 '25 18:06 matthewdouglas