LLaMA-Factory [HELP] Runable solution of RTX 5090 GPU + Linux Driver version + Pytorch version + Deepspeed version for LLM finetuning

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

I want to finetune a LLM (e.g Llama-3-8B) using RTX 5090 on ubuntu 22.04.5 LTS, but failed.

I used: Linux driver: https://www.nvidia.com/en-us/drivers/details/240524/

Pytoch version: pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

Deepspeed version: 0.15.1 (or 0.16.3)

I hope someone give a possible/runable solution for Linux Driver version + Pytorch version + Deepspeed version for LLM finetuning using RTX 5090

Reproduction

Put your message here.

EXPERIMENT_NAME="Test" eval "$(/opt/miniconda3/bin/conda shell.bash hook)" cd /home/test conda activate finetune_01 export PATH=/usr/local/cuda-12.8/bin:${PATH} export LD_LIBRARY_PATH=/usr/local/cuda-12.8/lib64:${LD_LIBRARY_PATH} export LD_LIBRARY_PATH=$ROCM_PATH/lib:$LD_LIBRARY_PATH export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export WANDB_DISABLED=true export CUDA_VISIBLE_DEVICES=0 deepspeed --num_gpus 1 --num_nodes 1 /src/train.py --stage sft --model_name_or_path "/home/z890/model/Llama-3-8B" --do_train --dataset "alpaca" --max_length 4096 --finetuning_type lora --output_dir "/home/test/output/" --overwrite_cache --per_device_train_batch_size 2 --gradient_accumulation_steps 1 --lr_scheduler_type cosine --logging_steps 1 --save_strategy steps --save_steps 5000 --learning_rate 4e-4 --num_train_epochs 17 --plot_loss --fp16 --lora_target q_proj,v_proj --lora_r 1024 --lora_alpha 2048 --lora_dropout 0.05 --preprocessing_num_workers 96 --template default --deepspeed ~/deepspeed/deepspeed_stage3.json

Others

No response

Feb 17 '25 02:02 0781532

Have you tried the docker images?

Feb 17 '25 09:02 hiyouga

Why not post any error messages you got? This makes debugging impossible.

Although I am not running conda, maybe still can help someone else having similar problems.

LLaMA-Factory + RTX 5090 GPU + Windows 11 WSL2 running Ubuntu 22 LTS

Within the host Windows 11 first I have installed the current Game Ready driver (572.42 from Feb 13th 2025 for me).

Then I have installed Ubuntu 22 LTS as a WSL2 guest.

Now within Ubuntu 22 I intalled Python 3.12.8 and then installed the CUDA Toolkit 12.8 according to https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin && \
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-wsl-ubuntu-12-8-local_12.8.0-1_amd64.deb && \
sudo dpkg -i cuda-repo-wsl-ubuntu-12-8-local_12.8.0-1_amd64.deb && \
sudo cp /var/cuda-repo-wsl-ubuntu-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
sudo apt-get update && \
sudo apt-get -y install cuda-toolkit-12-8 && \
CUDA_HOME=/usr/local/cuda && \
PATH=${CUDA_HOME}/bin${PATH:+:${PATH}} && \
LD_LIBRARY_PATH=${CUDA_HOME}/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} && \
export LD_LIBRARY_PATH && \
export CUDA_HOME && \
export PATH

# check nvidia-smi:
$ nvidia-smi
Thu Feb 20 12:11:09 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.17              Driver Version: 572.42         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   46C    P8             36W /  575W |    2494MiB /  32607MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A              66      G   /Xwayland                             N/A      |
+-----------------------------------------------------------------------------------------+

After doing the usual LLaMA-Factory installation routine...

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && cd LLaMA-Factory && \
python3 -m venv llamafactoryenv && source llamafactoryenv/bin/activate && \
pip install -e ".[torch,metrics]"

...I have tried...

(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

...but got the following warning message:

llamafactoryenv/lib/python3.12/site-packages/torch/cuda/__init__.py:235: UserWarning:
NVIDIA GeForce RTX 5090 with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
If you want to use the NVIDIA GeForce RTX 5090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

and thereafter I got the following error messages:

RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

These warning and error messages result from using the old torch version which is incompatible with the RTX 50xx blackwell GPU series:

(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ python
>>> import torch
>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.6.0+cu124

Solution: Install new nightly torch torchvision torchaudio as follows:

# ensure you are still within the same, activated env of LLaMA-Factory!!!
# and do the following AFTER the usual pip install -e ".[torch,metrics]" because else you overwrite the new version again

(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128

# verify torch version:
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ python
  >>> import torch
  >>> print("PyTorch version:", torch.__version__)
  PyTorch version: 2.7.0.dev20250218+cu128
# more tests, see https://github.com/jayrodge/NVIDIA-RTX5090-AI-Dev-Setup/blob/main/pytorch_huggingface_rtx5090_setup.ipynb
# thanks to https://www.youtube.com/watch?v=af7XjGekm4g

Now just try again:

(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml

Output:

[INFO|tokenization_utils_base.py:2500] 2025-02-20 12:45:43,523 >> tokenizer config file saved in saves/llama3-8b/lora/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2509] 2025-02-20 12:45:43,523 >> Special tokens file saved in saves/llama3-8b/lora/sft/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9826
  total_flos               = 22730138GF
  train_loss               =     0.9175
  train_runtime            = 0:23:30.99
  train_samples_per_second =       2.32
  train_steps_per_second   =      0.289
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
[WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_loss to plot.
[WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2025-02-20 12:45:43,823 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

Furthermore now you can do

(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
(llamafactoryenv) myuser@xxx:~/LLaMA-Factory$ llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml

It is probably similar with conda but I have no experience with that.

Feb 20 '25 12:02 cyril23

Nowadays training is much faster than in 2/2025. Current performance from example llamafactory-cli train examples/train_lora/llama3_lora_sft.yaml:

[INFO|tokenization_utils_base.py:2510] 2025-03-31 23:26:19,095 >> tokenizer config file saved in saves/llama3-8b/lora/sft/tokenizer_config.json
[INFO|tokenization_utils_base.py:2519] 2025-03-31 23:26:19,096 >> Special tokens file saved in saves/llama3-8b/lora/sft/special_tokens_map.json
***** train metrics *****
  epoch                    =     2.9826
  total_flos               = 22730138GF
  train_loss               =     0.9239
  train_runtime            = 0:13:17.65
  train_samples_per_second =      4.103
  train_steps_per_second   =      0.511
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
[WARNING|2025-03-31 23:26:19] llamafactory.extras.ploting:148 >> No metric eval_loss to plot.
[WARNING|2025-03-31 23:26:19] llamafactory.extras.ploting:148 >> No metric eval_accuracy to plot.
[INFO|modelcard.py:449] 2025-03-31 23:26:19,254 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}}

versus old ones from https://github.com/hiyouga/LLaMA-Factory/issues/6958#issuecomment-2671295077

***** train metrics *****
  epoch                    =     2.9826
  total_flos               = 22730138GF
  train_loss               =     0.9175
  train_runtime            = 0:23:30.99
  train_samples_per_second =       2.32
  train_steps_per_second   =      0.289
Figure saved at: saves/llama3-8b/lora/sft/training_loss.png
[WARNING|2025-02-20 12:45:43] llamafactory.extras.ploting:162 >> No metric eval_loss to plot.

Maybe the performance jump is from CPU upgrade (Intel 9900K -> AMD 9800X3D), or from the more recent software:

python
Python 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bitsandbytes
>>> import torch
>>> print(bitsandbytes.__version__)
0.45.5.dev0
>>> print(torch.__version__)
2.8.0.dev20250331+cu128
$ nvidia-smi
Mon Mar 31 23:29:55 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5090        On  |   00000000:01:00.0  On |                  N/A |
|  0%   43C    P8             24W /  575W |    2129MiB /  32607MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Installation after fresh WSL2 with Ubuntu 22 LTS, with current GPU driver 572.83 in Windows:

cd ~ && \
sudo apt-get install -y python3.10-venv python3.10-dev python3.10-full && \
sudo apt-get install -y cmake && \
# CUDA Toolkit installation according to https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin && \
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600 && \
wget https://developer.download.nvidia.com/compute/cuda/12.8.1/local_installers/cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb && \
sudo dpkg -i cuda-repo-wsl-ubuntu-12-8-local_12.8.1-1_amd64.deb && \
sudo cp /var/cuda-repo-wsl-ubuntu-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/ && \
sudo apt-get update && \
sudo apt-get -y install cuda-toolkit-12-8 && \
CUDA_HOME=/usr/local/cuda && \
PATH=${CUDA_HOME}/bin${PATH:+:${PATH}} && \
LD_LIBRARY_PATH=${CUDA_HOME}/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} && \
export LD_LIBRARY_PATH && \
export CUDA_HOME && \
export PATH && \
cd ~ && \
# LLaMa-Factory comes here:
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git && \
cd LLaMA-Factory && \
git checkout 2d421c57bf85c043f290b757c1f4d0ffe633f23f && \ # keep a good one, newer ones might cause dep errors
python3 -m venv llamafactoryenv && \
source llamafactoryenv/bin/activate && \
# from https://github.com/hiyouga/LLaMA-Factory/?tab=readme-ov-file#installation
pip install -e ".[torch,metrics]" && \
pip install deepspeed bitsandbytes vllm optimum && \
pip install --upgrade --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128 && \
# now we need to build bitsandbytes and triton from source:
cd ~ && git clone https://github.com/bitsandbytes-foundation/bitsandbytes.git && cd bitsandbytes/ && \
		cmake -DCOMPUTE_BACKEND=cuda -S . && \
		make && \
		pip install . && \
cd ~ && git clone https://github.com/triton-lang/triton.git && \
		cd triton && \
		pip install -r python/requirements.txt && \
		pip install -e python && \
		# edit: to avoid error "No module named 'triton.runtime.jit", you should do this too:
		cd python/ && python setup.py install && \
cd ~ && cd LLaMA-Factory && \
# do the examples:
pip install "huggingface_hub[hf_transfer]" && pip install hf_transfer && huggingface-cli login && \
llamafactory-cli train examples/train_qlora/llama3_lora_sft_otfq.yaml
llamafactory-cli chat examples/inference/llama3_lora_sft.yaml
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
# edit: for some workflows you need these installed too:
pip install optimum>=1.17.0
pip install auto_gptq>=0.5.0

Mar 31 '25 21:03 cyril23

further highjacking this issue for general RTX 5060/5070/5080/5090 compability info with LLaMA-Factory:

for VLLM inference (https://github.com/hiyouga/LLaMA-Factory?tab=readme-ov-file#deploy-with-openai-style-api-and-vllm -> API_PORT=8000 llamafactory-cli api examples/inference/llama3_vllm.yaml), Blackwell support is still lacking. Best you can try is checkout https://github.com/vllm-project/vllm/issues/14452 to build VLLM yourself for blackwell. It works, I've tried it out. Else we still need to wait for the Pytorch 2.7.0 Release (a release candidate is coming soon it appears)

Apr 02 '25 10:04 cyril23

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ anonymous@Anonymous:~$

work :D with tensorflow 2.16.1 AND tensorflow NIGHTLY 2.20 / Cuda Toolkkit 12.8 and cuDNN 8.9.7 under WSL2 windows 11 PRO Ubuntu 22.04 the RTX 5090 =D its a dream but it work :) / Instagramm: Stonypiles

Apr 21 '25 07:04 nibbler51

You probably need boost cuda from 124 to 128 to support Blackwell series GPUs.

Have you tried the docker images?

Aug 08 '25 15:08 fengwang