bitsandbytes Enable common device abstraction for 8bits/4bits

As one of the plans in the RFC https://github.com/TimDettmers/bitsandbytes/issues/894. This PR intends to add a device abstraction that allows the supports of non-CUDA devices to be added into bitsandbytes (for 8bits/4bits functionality) more commonly and easily.

To create a device backend abstraction, this PR creates a common backend folder, which contains the key kernel interfaces class to implement by each backend, a backend registration interface to add new device backends, and a kernel dispatching mechanism.

Key interfaces from bitsandbytes.functional  that are for the scope of 8bits and 4bits:
| F.igemmlt | F.double_quant| F.mm_dequant| F.transform| F.extract_outliers| F.quantize_4bit| F.dequantize_4bit |

The common backends will further be used in bitsandbytes.functional to implement the 8bits/4bits functionality.

Currently, the backend only contains CUDA as it is without this PR.

Dec 05 '23 06:12 jianan-gu

@yao-matrix @jiqing-feng

Dec 05 '23 07:12 jgong5

Hi @TimDettmers, Could you please have a review on this PR for any comment/suggestion. Thanks!

Dec 05 '23 07:12 jianan-gu

Thanks for your contribution! We'll look into it and get back to you soon.

Dec 05 '23 11:12 Titus-von-Koeller

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

Jan 02 '24 06:01 TimDettmers

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

That sounds like a good approach to me. The only thing there I remember is that there was some specific code around device selection currently. Not sure if it is needed or not.

Jan 02 '24 07:01 rickardp

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

Sounds good. I mostly lean on the CUDA implementation anyways and this wouldn't bother the preprocessor solution. It would also help if we separate HIPDevice after all, since only 4-bit has issues with wave64 and we could put that on hold. I'll write more on the AMD PR.

Jan 02 '24 12:01 arlo-phoenix

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

Using ```extern C`` would be definitely helpful to reduce custom code on the python side. I think that through this only the kernels can be directly added and compiled without having to change much of the python code. Only doubt I have is that for pythonic device selection, would this imply no more using conditions ?

Jan 02 '24 13:01 abhilash1910

Please give us feedback on the extern C approach so we can work out the details

@TimDettmers Is the integration via the C API the only option for backend integration? Would there be the flexibility for integration at the python level too, as is proposed by this RFC https://github.com/TimDettmers/bitsandbytes/issues/894 proposed? The benefit is that the integration is light-weight and can leverage existing offerings from other python acceleration libraries (e.g., Intel Extension for PyTorch) and PyTorch pythonic compilation stack.

Jan 03 '24 01:01 jgong5

Thank you so much for this contribution. We discussed internally how to best integrate this and other libraries. We think it is best to abstract the extern C interface so that none (or very little) of the Python code needs to be changed. This has the advantage that no new tests would need to be written to test the functionality of other devices.

In the next weeks, we will work on splitting the extern C "god interface" into more manageable sub-interfaces for (1) 4-bit, (2) 8-bit, (3) 8-bit optimizers. Each vendor can then implement this specific extern C interface to implement the desired sub-functionality.

What do you think about this approach? @rickardp, @abhilash1910, and @arlo-phoenix: this is also relevant for your integration. Please give us feedback on the extern C approach so we can work out the details

Hi @TimDettmers,

Does the extern C interface you mentioned handle functions abstraction on different devices as this PR does?

Especially for device selection and some device-specific codes (as mentioned above by others), it shall be more flexible to be done on the Python side. Since different devices may have different requirements to call into the implementations (e.g., for double_quant, CPU does not require some specifc layout while CUDA may need).

This PR focuses on enabling a common device selection and dispatching on the Python side, it leaves a buffer for different devices to implement their backend key functions (it can be on top of any further extern C interface). Different backends can also have the chance to share common functions and meanwhile have their own unique implementations (no matter whether implementations come from a extern C lib or just a Python lib).

Jan 10 '24 09:01 jianan-gu

Just to give you a heads-up about timeline and logistics on this issue: I am interviewing for academic positions in the next two months and will only be sparingly able to contribute much to this discussion. @Titus-von-Koeller and @younesbelkada are working on this, but also have other responsibilities and progress is likely to be a little slower from our side. What would be helpful if you can draft solutions, we will be able to have a look so we can make progress over time.

As for the remaining questions: Yes, we will need some device abstractions in Python for selection/detection of the device, and these might be different for different devices. The goal would be to keep them as minimal as possible.

@TimDettmers Is the integration via the C API the only option for backend integration? Would there be the flexibility for integration at the python level too, as is proposed by this RFC https://github.com/TimDettmers/bitsandbytes/issues/894 proposed?

What is also possible is to "mock" the extern C API. This means you can write your own wrapper to your own device code in Python, but it should be abstracted to mimick the calling of the extern C API. For example, lib.cdequantize_blockwise_fp32_nf4 calls the extern C device function for NF4 dequantization with 32-bit input tensors. It expects the following C arguments: NULL, ptrA, ptrAbsmax, ptrOut, blocksize, n

So what you can do is create a lib class where lib.cdequantize_blockwise_fp32_nf4 calls custom python code that then calls your device code, but the name of the function and the interface to lib.cdequantize_blockwise_fp32_nf4 need to be the same.

So, if you build a Python abstraction, it is expected that you also build a library class that can call such functions and execute the right code. This will ensure that all written tests can be immediately executed on your device and that all devices call the function with the same name.

Please let me know if you have more questions. I will get back to you as soon as I can.

Jan 24 '24 15:01 TimDettmers

What is also possible is to "mock" the extern C API. This means you can write your own wrapper to your own device code in Python, but it should be abstracted to mimick the calling of the extern C API. For example, lib.cdequantize_blockwise_fp32_nf4 calls the extern C device function for NF4 dequantization with 32-bit input tensors. It expects the following C arguments: NULL, ptrA, ptrAbsmax, ptrOut, blocksize, n

So what you can do is create a lib class where lib.cdequantize_blockwise_fp32_nf4 calls custom python code that then calls your device code, but the name of the function and the interface to lib.cdequantize_blockwise_fp32_nf4 need to be the same.

So, if you build a Python abstraction, it is expected that you also build a library class that can call such functions and execute the right code. This will ensure that all written tests can be immediately executed on your device and that all devices call the function with the same name.

Please let me know if you have more questions. I will get back to you as soon as I can.

Thanks for the feedbacks @TimDettmers . The major concern of integrating at these lib.c APIs is that they are at too low level with raw pointers and also may contain device-backend specifics. The Python-level integration proposed in this PR and in the RFC refer to something at higher level with PyTorch tensors and be device-agnostic as much as possible. Let me elaborate from these two points.

The high-level API abstraction with PyTorch tensors allow backend implementation to leverage existing PyTorch mechanism like ATen kernel registration and PyTorch compilation which work on PyTorch tensors too. Requiring raw pointers in the interface put constraints on how these APIs can be implemented. So for lib.cdequantize_blockwise_fp32_nf4 with NULL, ptrA, ptrAbsmax, ptrOut, blocksize, n, it would be more flexible to have lib.dequantize_blockwise_fp32_nf4 with None, A, Absmax, Out, blocksize, n.
The high-level API abstraction aims to be device-agnostic while the low-level lib.c API may contain device-backend specific semantics. For example, lib.cigemm_lt_tuning_32 and lib.ctransform_row2turingT, which are CUDA specific, would better be abstracted at higher level with igemm_lt and transform.

May I know your concern of a device abstraction with the high-level APIs? Thanks!

Jan 25 '24 06:01 jgong5

Thank you. On second thought, I think your idea of a slightly higher-level device abstraction is better than what we had in mind before. We could lift the interface by one level and, instead of pointers, have torch tensors.

The main concern right now is that I have very limited time, and all work will fall onto others. With that, I would like to keep the changes required to NVIDIA GPU functions minimal. I think another issue is if PRs are too large and need to be managed in bits and bytes rather than a binary, then all the maintenance and integration costs will fall on us. Currently, we do not have the personpower to manage a large overhead from our side.

I think a thin wrapper around torch tensors might be the way to go. I think then all the device-specific abstraction can happen in the lib class, and anybody can call a binary that is structured as they choose.

What do others think about this approach? @Titus-von-Koeller @younesbelkada @rickardp, @abhilash1910, @jianan-gu @arlo-phoenix

Jan 25 '24 15:01 TimDettmers

Thanks for the feedbacks @TimDettmers

The main concern right now is that I have very limited time, and all work will fall onto others. With that, I would like to keep the changes required to NVIDIA GPU functions minimal. I think another issue is if PRs are too large and need to be managed in bits and bytes rather than a binary, then all the maintenance and integration costs will fall on us. Currently, we do not have the personpower to manage a large overhead from our side.

I think a thin wrapper around torch tensors might be the way to go. I think then all the device-specific abstraction can happen in the lib class, and anybody can call a binary that is structured as they choose.

We do have plan to contribute this device abstraction. Intel can also offer a hand to further maintain this by collaborating you and other HW vendors. If the direction looks fine with others, we can start with the abstraction that you suggested via the "lib" class and revise this PR accordingly for you and other HW vendors to review. Does that sound good to you?

Jan 26 '24 01:01 jgong5

Agreed, this sounds like the right approach. One benefit is also that it would make it possible to create sub packages and only rebuild the parts that were actually changed, while keeping a common higher level API more stable. Maybe even trigger different reviewers and so on. Not that any of this changed, but I suspect that if not done this way the “common” code would be touched quite often as you write.

I think another issue is if PRs are too large and need to be managed in bits and bytes rather than a binary

IMHO we are going to need a good test suite as the PRs hopefully starts rolling in. Hopefully this can be a community effort, but I guess reviewing these will require a bit more of an effort from the core maintainers.

Jan 28 '24 22:01 rickardp

The concept and idea looks good here, and gels with my discussion comment about splitting into multiple Python packages.

I think we should refine the design a bit, though. IMO:

Backends shouldn't be a singleton-ish class that exists; just a module-level registry dict is enough and more Pythonic. A register_backend(device_type, backend_instance) function could of course exist (and see below for why instance, not class).
There should be an actual Backend abstract base class that describes the interface for all backends with @abstractmethods. The "interfaces are not complete" check would then be useless, as one can't instantiate a class derived from an ABC if it doesn't have concrete implementations of all of the abstract-marked methods.

Feb 01 '24 07:02 akx

I agree very much with @akx that an abstract base class seems like a very good approach here. This also makes the interface explicit, which I like. Even if it's not strictly needed in Python, I think it's a very good practice. We would like the backends to be gradually/partially implementable, in that sense the outlines solution approach by @akx is also excellent.

On another note, someone knowledgable (@albanD from PyTorch) that I discussed this with also mentioned the following:

On the C++ side, PyTorch had quite a bit of success with registration based interfaces like this.

That allows the backend to be it's own .so with it's own implementation and you can register it however you want (by linking it into your binary, loading it dynamically from python, etc).

Feb 02 '24 05:02 Titus-von-Koeller

On the question of testing Regarding the topic of testing, based on a question by @jgong5:

The tests can be conducted at the python level on the key interfaces abstracted. May I know anything particular you are concerned with about the testing?

We think a core question to discuss is how to run the test suite with different backends in order to make sure bitsandbytes development can go forward without the risk of hindering future developments by creating more headwind for new features or refactoring of existing functionality.

In our view there's a lot of work to be done here, from various sides. I summarized my current understanding here: Please review #1031 and let's discuss how we can best solve these challenges together (and if there are any considerations yet missing).

The idea is that I update the top-most post in this RFC and the other cross-platform RFCs in order to keep a summary of status, outstanding work and points that still need discussion, based on community input, discussion and agreement.

Thanks everyone for the good work so far on these topics and Intel for their patience while we align with everyone to make this possible!

Feb 05 '24 19:02 Titus-von-Koeller

Please see #997 for an overview of what discussions and work are ongoing around the cross-platform effort.

Feb 05 '24 19:02 Titus-von-Koeller

The concept and idea looks good here, and gels with my discussion comment about splitting into multiple Python packages.

I think we should refine the design a bit, though. IMO:

Backends shouldn't be a singleton-ish class that exists; just a module-level registry dict is enough and more Pythonic. A register_backend(device_type, backend_instance) function could of course exist (and see below for why instance, not class).

There should be an actual Backend abstract base class that describes the interface for all backends with @abstractmethods. The "interfaces are not complete" check would then be useless, as one can't instantiate a class derived from an ABC if it doesn't have concrete implementations of all of the abstract-marked methods.

Hi, @akx

Thanks for your valuable suggestions, this base class does make things more clear in both backend definition and registering. Here we made according changes to this PR: (1) define a common abstract base backend class for the implementation of all device backend classes (2) register the backend with register_backend(device_type, backend_instance) and remove unnecessary interface checkings

Feb 06 '24 15:02 jianan-gu

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Feb 07 '24 03:02 github-actions[bot]

Thanks, this is looking better 😄 A handful of comments within to make this more future-proof...

Thanks for your kind and valuable reviews, here have refined this PR accordingly :)

Feb 07 '24 15:02 jianan-gu

Much better, thank you for putting up with my suggestions 😂

We'll need to come back to giving the rest of the backend methods types and docstrings, but more pressingly: I tried to run the tests here and there's a circular import issue now:

$ py.test --assert=plain --durations=20 tests/
[...]
ImportError while importing test module '/home/akx/bitsandbytes/tests/test_autograd.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
/usr/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/test_autograd.py:6: in <module>
    import bitsandbytes as bnb
bitsandbytes/__init__.py:6: in <module>
    from . import device_setup, research, utils
bitsandbytes/research/__init__.py:2: in <module>
    from .autograd._functions import (
bitsandbytes/research/autograd/_functions.py:8: in <module>
    from bitsandbytes.autograd._functions import GlobalOutlierPooler, MatmulLtState
bitsandbytes/autograd/__init__.py:1: in <module>
    from ._functions import get_inverse_transform_indices, undo_layout
bitsandbytes/autograd/_functions.py:10: in <module>
    import bitsandbytes.functional as F
bitsandbytes/functional.py:15: in <module>
    from bitsandbytes.backends import backends
bitsandbytes/backends/__init__.py:14: in <module>
    from .cuda import CUDABackend
bitsandbytes/backends/cuda.py:7: in <module>
    from bitsandbytes.functional import (
E   ImportError: cannot import name 'CUBLAS_Context' from partially initialized module 'bitsandbytes.functional' (most likely due to a circular import) (/home/akx/bitsandbytes/bitsandbytes/functional.py)

This would be fixed by moving the registration of the CUDA backend from bitsandbytes/backends/__init__.py to the very end of bitsandbytes/__init__.py:

if COMPILED_WITH_CUDA:
    from .backends import register_backend
    from .backends.cuda import CUDABackend
    register_backend("cuda", CUDABackend())

I can't run the full test suite successfully anyway on my machine (old consumer-grade graphics card), but this makes the test suite at least start :)

Hi @akx Thanks for the checking! I have committed your fix and will conduct more tests with the current fix. The key issue here was the order of initing the backend, meanwhile, I would consider if there is a better way for this fix.

Besides, yes, we need to add the docstrings as well in the following PRs (which I think may need a dedicated PR for discussion and review).

Feb 08 '24 15:02 jianan-gu

Thanks for your great work @jianan-gu @akx and all ! Indeed it would be great to add nice docstrings as we can now autogenerate API docs from docstrings thanks to @Titus-von-Koeller, in the backend objects but this can definitely done in a follow up PR @akx we should probably merge this PR then maybe yours: https://github.com/TimDettmers/bitsandbytes/pull/1041 to avoid additional merge conflicts? @jianan-gu would you be able to run transformers + bnb tests within your machine (you need access to a machine with CUDA device) ? Just to make sure all will go well once we merge the PR. You need to install bnb from source on this branch then

1- git clone https://github.com/huggingface/transformers && cd transformers/ 2- RUN_SLOW=1 pytest tests/quantization/bnb/test_4bit.py

Let me know if you need any help ! I will let @Titus-von-Koeller take the lead on giving a final review and merging the PR 💪

Feb 08 '24 23:02 younesbelkada

@younesbelkada Sure, I don't mind rebasing #1041 after this gets merged.

Feb 09 '24 06:02 akx

Thanks for your great work @jianan-gu @akx and all ! Indeed it would be great to add nice docstrings as we can now autogenerate API docs from docstrings thanks to @Titus-von-Koeller, in the backend objects but this can definitely done in a follow up PR @akx we should probably merge this PR then maybe yours: #1041 to avoid additional merge conflicts? @jianan-gu would you be able to run transformers + bnb tests within your machine (you need access to a machine with CUDA device) ? Just to make sure all will go well once we merge the PR. You need to install bnb from source on this branch then

1- git clone https://github.com/huggingface/transformers && cd transformers/ 2- RUN_SLOW=1 pytest tests/quantization/bnb/test_4bit.py

Let me know if you need any help ! I will let @Titus-von-Koeller take the lead on giving a final review and merging the PR 💪

Hi @younesbelkada, thanks for your nice advice! :)

For the tests on this PR branch (jianan-gu:upstream_device_abstraction), I compare the results of a base upstream commit (https://github.com/TimDettmers/bitsandbytes/commit/136721a8c1437042f0491972ddc5f35695e5e9b2) for a cross-check. And I check the following test cases on a CUDA device (nvcc 12.3, A100):

Test in BNB repo:

pytest test_*
test_autograd.py              test_functional.py  test_linear4bit.py    test_modules.py  test_triton.py
test_cuda_setup_evaluator.py  test_generation.py  test_linear8bitlt.py  test_optim.py

Results: both base upstream and this PR branch have the same pass statuses (test_triton.py both failed on these two branches, due to my env issue, i.e., triton version related)

Test in HF repo (https://github.com/huggingface/transformers/commit/2749e479f30ab13235b0b9b4a6bbcf4c3b29a081):

git clone https://github.com/huggingface/transformers && cd transformers/
RUN_SLOW=1 pytest tests/quantization/bnb/test_4bit.py

Results: both base upstream and this PR branch have the same pass rate

Feb 09 '24 10:02 jianan-gu

Thanks everyone! Tim and I will be reviewing the PR this end of week and we'll try to get it merged asap.

Feb 14 '24 18:02 Titus-von-Koeller

I'm still in the process of reviewing, but just wanted to give a quick update. I've run the tests for this PR and also for the merge base 88ab6303. The test results are attached. From what I can tell after detailed review, all tests that are newly failing relative to 88ab6303 are due to flakiness and tolerance of the set bounds. I'm looking forward to when we will have a CI with reproducible tests. test_output_pr898.log test_output_merge-base_88ab6303.log

I also ran the Transformers/BNB integration test suite with RUN_SLOW=1 pytest tests/quantization/bnb/ and everything was fine.

From that side everything looks good. I also took a glance over the code/ recent changes and generally everything looks very good. I'll review everything more in depth tmr with fresh eyes. Thanks everyone for the good work!

Tim will then have the final say.

Feb 16 '24 00:02 Titus-von-Koeller

Ok, so I did a thorough review and everything still looks very good. Other than some cosmetic stuff, the one thing that I think is important is to not duplicate that assert code and raise an exception instead, see ensure_backend_is_available. Let me know if you like that approach or if you prefer sth else.

I'm not fully done with my eval, but overall from my side we have green light. I might want to still add a few small refactorings, but I can't finish that today. I'll wrap up first thing on Monday, but IMO this is ready for Tim's review. I'll then run the tests one last time and merge. I also have some doc strings ready as well. I don't want to commit these changes now that I don't have a clear mind, but will do so asap on Monday.

For me, I still have two questions that I would like to discuss:

Would versioning of that interface make any sense? I wonder if this could help us allow for controlled change over time, but I'm not sure how applicable this is here. Sth like this:

class Backend(ABC):
    VERSION = "1.0.0"  # Semantic Versioning

Can we improve the naming of the methods in the interface? I know those were simply adapted from Tim's code, but I think we should harmonize those a bit
- double_quant -> double_quantize as quantize_4bit also uses the verb form.
- same for mm_dequant
- igemmlt: not sure what do with this one, it stands for integer gemm light

Feb 17 '24 00:02 Titus-von-Koeller

Would versioning of that interface make any sense? I wonder if this could help us allow for controlled change over time, but I'm not sure how applicable this is here. Sth like this:

Yes, agreed, semantics versioning sounds reasonable, particularly for out-of-tree device backend registration.

Feb 17 '24 09:02 jgong5

Ok, so I did a thorough review and everything still looks very good. Other than some cosmetic stuff, the one thing that I think is important is to not duplicate that assert code and raise an exception instead, see ensure_backend_is_available. Let me know if you like that approach or if you prefer sth else.

I'm not fully done with my eval, but overall from my side we have green light. I might want to still add a few small refactorings, but I can't finish that today. I'll wrap up first thing on Monday, but IMO this is ready for Tim's review. I'll then run the tests one last time and merge. I also have some doc strings ready as well. I don't want to commit these changes now that I don't have a clear mind, but will do so asap on Monday.

Thanks for the refactorings! :)

For me, I still have two questions that I would like to discuss:

Would versioning of that interface make any sense? I wonder if this could help us allow for controlled change over time, but I'm not sure how applicable this is here. Sth like this:
class Backend(ABC):
    VERSION = "1.0.0"  # Semantic Versioning

Adding semantics versioning sounds good. Do we allow to version it per device/instance or general for all devices/instances? The former seems more flexible and the latter is more clean.

Can we improve the naming of the methods in the interface? I know those were simply adapted from Tim's code, but I think we should harmonize those a bit

double_quant -> double_quantize as quantize_4bit also uses the verb form.

same for mm_dequant

igemmlt: not sure what do with this one, it stands for integer gemm light

Yes, double_quant -> double_quantize and mm_dequant -> mm_dequantize do look more straightforward and clear. Do we want to improve the naming in Backend Class and bitsandbytes.functional (and other files) to make them all aligned? Or just keep the minimal (and more clean) re-naming changes in Backend Class (then refactor other usages later in following PRs).

Feb 17 '24 17:02 jianan-gu

bitsandbytes bitsandbytes copied to clipboard

Enable common device abstraction for 8bits/4bits

bitsandbytes
bitsandbytes copied to clipboard