Runtime Error: BF16 unsupported on supported hardware
I am using the default lora_finetune_single_device.py without any modifications, and 2B_qlora_single_device.yaml without modifications. Running with an RTX 4090.
Attempting to use tune run lora_finetune_single_device.py --config 2B_qlora_single_device.yaml results in:
RuntimeError: bf16 precision was requested but not available on this hardware. Please use fp32 precision instead.
The environment is set up with mamba, latest torch, with all requirements met:
torch==2.3.0
torchao==0.1
torchaudio==2.3.0
torchtune==0.1.1
torchvision==0.18.0
I tried running the following:
>>> torch.cuda.is_available()
True
torch.cuda.is_bf16_supported()
True
Also tested on nightly:
torch==2.4.0.dev20240428+cu121
torchtune==0.2.0.dev20240428+cu121
Thanks for filing this issue!
I'm surprised you're running into this issue for two reasons:
- As you pointed out 4090s support bfloat16
- We relaxed this check for non-CUDA devices in this PR
I just launched a QLoRA training on a 4090 using the nightly build and didn't run into this. Mind checking the following as well?
torch.distributed.is_nccl_available()
torch.cuda.nccl.version() >= (2, 10)
Thanks for filing this issue!
I'm surprised you're running into this issue for two reasons:
- As you pointed out 4090s support bfloat16
- We relaxed this check for non-CUDA devices in this PR
I just launched a QLoRA training on a 4090 using the nightly build and didn't run into this. Mind checking the following as well?
torch.distributed.is_nccl_available() torch.cuda.nccl.version() >= (2, 10)
Ahh it seems that might be the issue, though I didn't know that was a requirement:
>>> torch.distributed.is_nccl_available()
False
>>> torch.cuda.nccl.version() >= (2, 10)
AttributeError: module 'torch._C' has no attribute '_nccl_version'
I had assumed I can launch this training on Windows, and I should've mentioned that is the OS I am using. Does this mean torchtune will not work on Windows?
@slobodaapl thanks for pointing this out. Right now I think we assume availability of NCCL in a couple places. @rohan-varma may know best here: is it sufficient to just point to Gloo backend for Windows, or is there more we need to do there for proper support?
@ebsmothers Gloo library is currently unmaintained and there at one point was minimal support for Windows, but no one's on the hook for maintaining that at the moment in PyTorch core. @slobodaapl currently torchtune is not tested on windows OSes, we've only run comprehensive tests and verification on linux machines at the moment.
I also had the
RuntimeError: bf16 precision was requested but not available on this hardware. Please use fp32 precision instead.
Problem.
Python Version: 3.11.9 PyTorch Version: 2.3.0+rocm6.0 BitsAndBytes Version: 0.43.2.dev Torchtune Version: 0.1.1+cpu GPU Info: 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 31 [Radeon RX 7900 XT] (rev cc) AMD Driver Version: OpenGL version string: 4.6 (Compatibility Profile) Mesa 24.0.8-arch1.1 OS Description: Arch Linux Kernel Version: 6.9.1-arch1-1
torch.cuda.is_bf16_supported() = True
torch.distributed.is_nccl_available() = True
torch.cuda.nccl.version() >= (2, 10) = True
- We relaxed this check for non-CUDA devices in this PR
Making these changes fixes the issue for me on the above system.
I'm seeing this error pop up a lot here https://discuss.pytorch.org/t/fine-tune-llms-using-torchtune/201804
I think this change is in the nightly not in the stable package. @Nihilentropy-117 can you try the instructions mentioned here: https://pytorch.org/torchtune/main/install.html. @msaroufim can you help point folks to this?
Couple of follow ups for us (cc: @ebsmothers)
- Make the nightly clearer on the README and also highlight which features are not in 0.1.1
- Package push - we're planning for in a couple of weeks.
Actually I responded to the thread. Thanks for sharing, @msaroufim! I haven't been keeping up with torchtune issues on pytorch discuss. Will do so moving forward.