bitsandbytes icon indicating copy to clipboard operation
bitsandbytes copied to clipboard

Feature Request: ROCm support (AMD GPU)

Open gururise opened this issue 2 years ago • 42 comments

Could you please add official AMD ROCm support to this library? An unofficial working port already exists:

https://github.com/broncotc/bitsandbytes-rocm

Thank You

gururise avatar Dec 11 '22 17:12 gururise

Amazing! Thank you for bringing this to my attention. I will try to get in touch with the author of the ROCm library and support AMD GPUs by default.

TimDettmers avatar Feb 02 '23 20:02 TimDettmers

Amazing! Thank you for bringing this to my attention. I will try to get in touch with the author of the ROCm library and support AMD GPUs by default.

that would be AMAZING! especially with you recently adding 8 bit support. I tried to make my own merge of the forks but I don't really know what I'm doing and don't think I did it correctly

YellowRoseCx avatar Feb 12 '23 12:02 YellowRoseCx

If the ROCm fork does get merged in, would the Int8 Matmul compatibility improvements also work for AMD GPUs?

anonymous721 avatar Feb 14 '23 04:02 anonymous721

@TimDettmers, curious if AMD support any nearer to being merged? @agrocylo made a PR (#296) based somewhat on @broncotc's fork...

deftdawg avatar Apr 17 '23 03:04 deftdawg

EDIT: A slightly newer version branched from v0.37 available here: https://github.com/Titaniumtown/bitsandbytes-rocm/tree/patch-2

gururise avatar Jun 20 '23 14:06 gururise

The Wikimedia foundation is really interested in the ROCm support too, since Nvidia is not viable for us due to open-source constraints. @TimDettmers we offer any help (testing/review/etc..) to help merge this feature, it would be really great for the ML open source ecosystem. Thanks in advance!

elukey avatar Jun 22 '23 12:06 elukey

EDIT: A slightly newer version branched from v0.37 available here: https://github.com/Titaniumtown/bitsandbytes-rocm/tree/patch-2

Hi, I'm also seeking an AMD-GPU-compatible version. I tried your patch-2 version but the code still cannot work. The error info looks like:

  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/autograd/__init__.py", line 1, in <module>
    from ._functions import undo_layout, get_inverse_transform_indices
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 9, in <module>
    import bitsandbytes.functional as F
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/functional.py", line 17, in <module>
    from .cextension import COMPILED_WITH_CUDA, lib
  File "/home/.local/lib/python3.8/site-packages/bitsandbytes/cextension.py", line 74, in <module>
    raise RuntimeError('''
RuntimeError: 
        CUDA Setup failed despite GPU being available. Inspect the CUDA SETUP outputs above to fix your environment!
        If you cannot find any issues and suspect a bug, please open an issue with detals about your environment:
        https://github.com/TimDettmers/bitsandbytes/issues

I use AMD MI200 card. Do you have any idea on this? Many thanks.

Aria-K-Alethia avatar Jul 17 '23 09:07 Aria-K-Alethia

Hello, I was wondering how far-off the ROCm support is. I'm trying to see if my 7900XTX will be useful in a project of mine. The Llama2 quick start guide makes use of bitsandbytes, and as far as I know there isn't any other alternatives.

PatchouliPatch avatar Oct 08 '23 08:10 PatchouliPatch

Found this rocm version of bitsandbytes: https://github.com/Lzy17/bitsandbytes-rocm/tree/main

jiagaoxiang avatar Oct 30 '23 23:10 jiagaoxiang

The only rocm version that worked for me on GFX900 was this one: https://github.com/agrocylo/bitsandbytes-rocm All the others failed to compile/install (Rocm 5.2)

mauricioscotton avatar Oct 31 '23 21:10 mauricioscotton

For anyone that needs a patch for RDNA3 cards I created this fork https://github.com/st1vms/bitsandbytes-rocm-gfx1100

This fork patches the Makefile for targeting gfx1100 amdgpu module along latest ROCM and clang17...and fixes some hip include warnings.

Works with a RX7900XT and ROCM5.7 (along with torch-rocm5.7) installed.

Anyway there should be a better way of targeting the correct amdgpu module in the build system...

Edit:

Probably won't work with libraries requiring version > 0.35

st1vms avatar Nov 23 '23 20:11 st1vms

@st1vms There is a problem. The version of BNB is 0.35.4 which is kind of outdated, and the latest version of Peft requires bitsandbytes>=0.37.0

Wintoplay avatar Dec 11 '23 14:12 Wintoplay

@st1vms There is a problem. The version of BNB is 0.35.4 which is kind of outdated, and the latest version of Peft requires bitsandbytes>=0.37.0

If that fork still works for you, maybe it is ok to just change the version number.

You can test if the library works with:

python -m bitsandbytes

If that is the case, try editing the version number in the setup.py of the fork before building and installing it, i.e. change it to 0.37.0 and see if Peft works...

st1vms avatar Dec 11 '23 15:12 st1vms

@st1vms I tried BNB 0.39.0 The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing

The Jupyter kernel crash, reason: undefined

Screenshot from 2023-12-12 00-11-47

image

Wintoplay avatar Dec 11 '23 17:12 Wintoplay

@st1vms I tried BNB 0.39.0 The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing

The Jupyter kernel crash, reason: undefined

Screenshot from 2023-12-12 00-11-47

Well, the fork is probably obsolete already for some libraries, you should look for updated ones.

st1vms avatar Dec 11 '23 17:12 st1vms

@st1vms I retried with a new virtual env and change from .ipynb to .py This is the result.

(torch3) win@win-MS-7E02:/mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/Learn$ /usr/bin/env /home/win/torch3/bin/python /home/win/.vscode-oss/extensions/ms-python.python-2023.20.0-universal/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher 60843 -- /mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/Learn/finetune.py /home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( /home/win/torch3/lib/python3.10/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:14<00:00, 7.21s/it] trainable params: 8388608 || all params: 6666862592 || trainable%: 0.12582542214183376 0%| | 0/200 [00:00<?, ?it/s]You're using a GPT2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. /home/win/torch3/lib/python3.10/site-packages/torch/utils/checkpoint.py:461: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( /home/win/torch3/lib/python3.10/site-packages/bitsandbytes-0.41.0-py3.10.egg/bitsandbytes/autograd/_functions.py:231: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")

============================================= ERROR: Your GPU does not support Int8 Matmul!

python: /mnt/1df6b45e-20dc-41ca-9a04-b271fd3a4940/bitsandbytes-rocm-gfx1100/csrc/ops.cu:347: int igemmlt(cublasLtHandle_t, int, int, int, const int8_t *, const int8_t *, void *, float *, int, int, int) [FORMATB = 3, DTYPE_OUT = 32, SCALE_ROWS = 0]: Assertion `false' failed.

Wintoplay avatar Dec 11 '23 18:12 Wintoplay

@st1vms I tried BNB 0.39.0 The dependencies seem fine. However, when I tried to lora finetune according to this https://colab.research.google.com/drive/1jCkpikz0J2o20FBQmYmAGdiKmJGOMo-o?usp=sharing The Jupyter kernel crash, reason: undefined Screenshot from 2023-12-12 00-11-47

Well, the fork is probably obsolete already for some libraries, you should look for updated ones.

Can someone post to this thread any updated forks? The lack of proper BnB support is really holding back the AMD cards.

gururise avatar Dec 11 '23 20:12 gururise

Looks like things may finally move forward with official support in the not too distant future! Hope with ROCm 6.x we can finally see support merged into this repo.

gururise avatar Dec 20 '23 16:12 gururise

Sorry for taking so long on this. I am currently onboarding more maintainers and we should see some progress on this very soon. This is one of our high-priority issues.

TimDettmers avatar Jan 01 '24 17:01 TimDettmers

Would love to see ROCM support, keep doing your good work

SakshamG7 avatar Jan 06 '24 08:01 SakshamG7

if I may ask, what's the progress so far?

PatchouliPatch avatar Feb 11 '24 13:02 PatchouliPatch

if I may ask, what's the progress so far?

If you haven't already seen it, there was a comment made in the discussions with an accompanying tracking issue for general cross-platform support rather than just AMD/ROCM support. To that end it appears it is currently in the planning phase.

Airradda avatar Feb 11 '24 19:02 Airradda

@TimDettmers @Titus-von-Koeller , we are at ~95% parity for bnb for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled on Instinct class gpus, and working to close the gaps on Navi. At this point, we should be seriously considering upstreaming. Could you drop me an email at [email protected], and we can set up a call to discuss further. cc: @sunway513 @Lzy17 @pnunna93

amathews-amd avatar Mar 23 '24 19:03 amathews-amd

@amathews-amd I tired compiling ROCm version of BnB from the rocm_enabled branch, but it is failing with errors on AMD MI250x. Do you have any suggestions for how to resolve the issue?

chauhang avatar Mar 29 '24 07:03 chauhang

@chauhang Could you try with rocm 6.0? You can use this docker - rocm/pytorch:rocm6.0.2_ubuntu22.04_py3.10_pytorch_2.1.2 and install bitsandbytes directly.

pnunna93 avatar Mar 29 '24 15:03 pnunna93

@pnunna93 I am already using ROCm 6.0 -- have added details of the pytorch environment here.

chauhang avatar Mar 29 '24 17:03 chauhang

@chauhang, you can skip the hipblaslt update and install bitsandbytes directly then. Please let me know if you face any issues.

pnunna93 avatar Mar 29 '24 18:03 pnunna93

I was using arlo-phoenix fork. https://github.com/arlo-phoenix/bitsandbytes-rocm-5.6/tree/rocm

Should I use the ROCm fork instead? https://github.com/ROCm/bitsandbytes/tree/rocm_enabled

ehartford avatar Mar 29 '24 18:03 ehartford

Yes, its updated for rocm 6

pnunna93 avatar Mar 29 '24 18:03 pnunna93

@TimDettmers @Titus-von-Koeller , we are at ~95% parity for bnb for https://github.com/ROCm/bitsandbytes/tree/rocm_enabled on Instinct class gpus, and working to close the gaps on Navi. At this point, we should be seriously considering upstreaming. Could you drop me an email at [email protected], and we can set up a call to discuss further. cc: @sunway513 @Lzy17 @pnunna93

I've often had trouble understanding the state of GPU support in ROCm. So with that said, I have some clarification questions:

  • Can we clarify on what we mean by "Instinct-class" GPUs?
    • The ROCm 6.0.2 docs suggest to me this is all CDNA, so MI100 and newer? Or is MI50 expected to work also?
  • What is the intention for Navi support?
    • Is this for RDNA2/RDNA3 only?
  • Is there intent to support with ROCm < 6?

I'd like to be able to help get this merged, but need to figure out the constraints. The only AMD GPUs that I have on hand (RX 570 and R9 270X) aren't going to cut it.

The other issue is how far behind main this is. Ideally this could be implemented as a separate backend as proposed in #898. We would want to change to use CMake for building. I also think that it'd be better to unify the C++/CUDA code with the hipify code and take care of most of the changes with conditional compilation.

matthewdouglas avatar Mar 30 '24 00:03 matthewdouglas