[HELP] Runable solution of RTX 5090 GPU + Linux Driver version + Pytorch version + Deepspeed version for LLM finetuning
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
I want to finetune a LLM (e.g Llama-3-8B) using RTX 5090 on ubuntu 22.04.5 LTS, but failed.
I used: Linux driver: https://www.nvidia.com/en-us/drivers/details/240524/
Pytoch version: pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
Deepspeed version: 0.15.1 (or 0.16.3)
I hope someone give a possible/runable solution for Linux Driver version + Pytorch version + Deepspeed version for LLM finetuning using RTX 5090
Reproduction
Put your message here.
EXPERIMENT_NAME="Test" eval "$(/opt/miniconda3/bin/conda shell.bash hook)" cd /home/test conda activate finetune_01 export PATH=/usr/local/cuda-12.8/bin:${PATH} export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export WANDB_DISABLED=true export CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus 1 --num_nodes 1 /src/train.py --stage sft --model_name_or_path "/home/z890/model/Llama-3-8B" --do_train --dataset "alpaca" --max_length 4096 --finetuning_type lora --output_dir "/home/test/output/" --overwrite_cache --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_strategy steps --save_steps 5000 --learning_rate 4e-4 --num_train_epochs 17 --plot_loss --fp16 --lora_target q_proj,v_proj --lora_r 1024 --lora_alpha 2048 --lora_dropout 0.05 --preprocessing_num_workers 96 --template default --deepspeed ~/deepspeed/deepspeed_stage3.json
Others
No response
Have you tried the docker images?
Why not post any error messages you got? This makes debugging impossible.
Although I am not running conda, maybe still can help someone else having similar problems.
LLaMA-Factory + RTX 5090 GPU + Windows 11 WSL2 running Ubuntu 22 LTS
Within the host Windows 11 first I have installed the current Game Ready driver (572.42 from Feb 13th 2025 for me).
Then I have installed Ubuntu 22 LTS as a WSL2 guest.
Now within Ubuntu 22 I intalled Python 3.12.8 and then installed the CUDA Toolkit 12.8 according to https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin && \
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-wsl-ubuntu-12-8-local_12.8.0-1_amd64.deb && \
sudo dpkg -i cuda-repo-wsl-ubuntu-12-8-local_12.8.0-1_amd64.deb && \
sudo cp /var/cuda-repo-wsl-ubuntu-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
sudo apt-get update && \
sudo apt-get -y install cuda-toolkit-12-8 && \
CUDA_HOME=/usr/local/cuda && \
PATH=${CUDA_HOME}/bin${PATH:+:${PATH}} && \
LD_LIBRARY_PATH=${CUDA_HOME}/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} && \
export LD_LIBRARY_PATH && \
export CUDA_HOME && \
export PATH
# check nvidia-smi:
$ nvidia-smi
Thu Feb 20 12:11:09 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.17 Driver Version: 572.42 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 On | N/A |
| 0% 46C P8 36W / 575W | 2494MiB / 32607MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 66 G /Xwayland N/A |
+-----------------------------------------------------------------------------------------+
After doing the usual LLaMA-Factory installation routine...
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && cd LLaMA-Factory && \
python3 -m venv llamafactoryenv && source llamafactoryenv/bin/activate && \
pip install -e ".[torch,metrics]"
...I have tried...
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
...but got the following warning message:
llamafactoryenv/lib/python3.12/site-packages/torch/cuda/__init__.py:235: UserWarning: NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
and thereafter I got the following error messages:
RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
These warning and error messages result from using the old torch version which is incompatible with the RTX 50xx blackwell GPU series:
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ python
>>> import torch
>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.6.0+cu124
Solution: Install new nightly torch torchvision torchaudio as follows:
# ensure you are still within the same, activated env of LLaMA-Factory!!!
# and do the following AFTER the usual pip install -e ".[torch,metrics]" because else you overwrite the new version again
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
# verify torch version:
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ python
>>> import torch
>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.7.0.dev20250218+cu128
# more tests, see https://github.com/jayrodge/NVIDIA-RTX5090-AI-Dev-Setup/blob/main/pytorch_huggingface_rtx5090_setup.ipynb
# thanks to https://www.youtube.com/watch?v=af7XjGekm4g
Now just try again:
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml
Output:
[INFO|tokenization_utils_base.py:2500] 2025-02-20 12:45:43,523 >> tokenizer config file saved in saves/llama3-8b/lora/sft/tokenizer_config.json [INFO|tokenization_utils_base.py:2509] 2025-02-20 12:45:43,523 >> Special tokens file saved in saves/llama3-8b/lora/sft/special_tokens_map.json ***** train metrics ***** epoch = 2.9826 total_flos = 22730138GF train_loss = 0.9175 train_runtime = 0:23:30.99 train_samples_per_second = 2.32 train_steps_per_second = 0.289 Figure saved at: saves/llama3-8b/lora/sft/training_loss.png [WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_loss to plot. [WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_accuracy to plot. [INFO|modelcard.py:449] 2025-02-20 12:45:43,823 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
Furthermore now you can do
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
It is probably similar with conda but I have no experience with that.
Nowadays training is much faster than in 2/2025. Current performance from example
llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml:
[INFO|tokenization_utils_base.py:2510] 2025-03-31 23:26:19,095 >> tokenizer config file saved in saves/llama3-8b/lora/sft/tokenizer_config.json [INFO|tokenization_utils_base.py:2519] 2025-03-31 23:26:19,096 >> Special tokens file saved in saves/llama3-8b/lora/sft/special_tokens_map.json ***** train metrics ***** epoch = 2.9826 total_flos = 22730138GF train_loss = 0.9239 train_runtime = 0:13:17.65 train_samples_per_second = 4.103 train_steps_per_second = 0.511 Figure saved at: saves/llama3-8b/lora/sft/training_loss.png [WARNING|2025-03-31 23:26:19] llamafactory.extras.ploting:148 >> No metric eval_loss to plot. [WARNING|2025-03-31 23:26:19] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot. [INFO|modelcard.py:449] 2025-03-31 23:26:19,254 >> Dropping the following result as it does not have all the necessary fields: {'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}
versus old ones from https://github.com/hiyouga/LLaMA-Factory/issues/6958#issuecomment-2671295077
***** train metrics ***** epoch = 2.9826 total_flos = 22730138GF train_loss = 0.9175 train_runtime = 0:23:30.99 train_samples_per_second = 2.32 train_steps_per_second = 0.289 Figure saved at: saves/llama3-8b/lora/sft/training_loss.png [WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_loss to plot.
Maybe the performance jump is from CPU upgrade (Intel 9900K -> AMD 9800X3D), or from the more recent software:
python Python 3.10.12 (main, Feb 4 2025, 14:57:36) [GCC 11.4.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import bitsandbytes >>> import torch >>> print(bitsandbytes.__version__) 0.45.5.dev0 >>> print(torch.__version__) 2.8.0.dev20250331+cu128 $ nvidia-smi Mon Mar 31 23:29:55 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 570.133.07 Driver Version: 572.83 CUDA Version: 12.8 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 On | N/A | | 0% 43C P8 24W / 575W | 2129MiB / 32607MiB | 6% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
Installation after fresh WSL2 with Ubuntu 22 LTS, with current GPU driver 572.83 in Windows:
cd ~ && \
sudo apt-get install -y python3.10-venv python3.10-dev python3.10-full && \
sudo apt-get install -y cmake && \
# CUDA Toolkit installation according to https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin && \
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb && \
sudo dpkg -i cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb && \
sudo cp /var/cuda-repo-wsl-ubuntu-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
sudo apt-get update && \
sudo apt-get -y install cuda-toolkit-12-8 && \
CUDA_HOME=/usr/local/cuda && \
PATH=${CUDA_HOME}/bin${PATH:+:${PATH}} && \
LD_LIBRARY_PATH=${CUDA_HOME}/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} && \
export LD_LIBRARY_PATH && \
export CUDA_HOME && \
export PATH && \
cd ~ && \
# LLaMa-Factory comes here:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
cd LLaMA-Factory && \
git checkout 2d421c57bf85c043f290b757c1f4d0ffe633f23f && \ # keep a good one, newer ones might cause dep errors
python3 -m venv llamafactoryenv && \
source llamafactoryenv/bin/activate && \
# from https://github.com/hiyouga/LLaMA-Factory/?tab=readme-ov-file#installation
pip install -e ".[torch,metrics]" && \
pip install deepspeed bitsandbytes vllm optimum && \
pip install --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 && \
# now we need to build bitsandbytes and triton from source:
cd ~ && git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git && cd bitsandbytes/ && \
cmake -DCOMPUTE_BACKEND=cuda -S . && \
make && \
pip install . && \
cd ~ && git clone https://github.com/triton-lang/triton.git && \
cd triton && \
pip install -r python/requirements.txt && \
pip install -e python && \
# edit: to avoid error "No module named 'triton.runtime.jit", you should do this too:
cd python/ && python setup.py install && \
cd ~ && cd LLaMA-Factory && \
# do the examples:
pip install "huggingface_hub[hf_transfer]" && pip install hf_transfer && huggingface-cli login && \
llamafactory-cli train examples/train_qlora/llama3_lora_sft_otfq.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
# edit: for some workflows you need these installed too:
pip install optimum>=1.17.0
pip install auto_gptq>=0.5.0
further highjacking this issue for general RTX 5060/5070/5080/5090 compability info with LLaMA-Factory:
for VLLM inference (https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#deploy-with-openai-style-api-and-vllm -> API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml), Blackwell support is still lacking. Best you can try is checkout https://github.com/vllm-project/vllm/issues/14452 to build VLLM yourself for blackwell. It works, I've tried it out. Else we still need to wait for the Pytorch 2.7.0 Release (a release candidate is coming soon it appears)
anonymous@Anonymous:~$ nvidia-smi Mon Apr 21 09:52:28 2025 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 545.29.06 Driver Version: 576.02 CUDA Version: 12.9 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 5090 On | 00000000:01:00.0 On | N/A | | 30% 49C P0 310W / 600W | 7705MiB / 32607MiB | 59% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ anonymous@Anonymous:~$
work :D with tensorflow 2.16.1 AND tensorflow NIGHTLY 2.20 / Cuda Toolkkit 12.8 and cuDNN 8.9.7 under WSL2 windows 11 PRO Ubuntu 22.04 the RTX 5090 =D its a dream but it work :) / Instagramm: Stonypiles
You probably need boost cuda from 124 to 128 to support Blackwell series GPUs.
Have you tried the docker images?