[XPU] library mismatch and version issue while performing fine-tuning on B580
Describe the bug performing, fine-tuning of llm-model, on BattleImage, facing conflicting with library issue, specially between transformer and supported bitsandbyte.
Traceback (most recent call last):
File "./qlora_finetuning.py", line 22, in <module>
from peft import LoraConfig
File "/envs/ft-test/lib/python3.11/site-packages/peft/__init__.py", line 22, in <module>
from .auto import (
File "/envs/ft-test/lib/python3.11/site-packages/peft/auto.py", line 31, in <module>
from .config import PeftConfig
File "/envs/ft-test/lib/python3.11/site-packages/peft/config.py", line 23, in <module>
from .utils import CONFIG_NAME, PeftType, TaskType
File "/envs/ft-test/lib/python3.11/site-packages/peft/utils/__init__.py", line 21, in <module>
from .loftq_utils import replace_lora_weights_loftq
File "/envs/ft-test/lib/python3.11/site-packages/peft/utils/loftq_utils.py", line 35, in <module>
import bitsandbytes as bnb
File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/__init__.py", line 15, in <module>
from .nn import modules
File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/nn/__init__.py", line 21, in <module>
from .triton_based_modules import (
File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/nn/triton_based_modules.py", line 6, in <module>
from bitsandbytes.triton.dequantize_rowwise import dequantize_rowwise
File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/triton/dequantize_rowwise.py", line 12, in <module>
import triton
File "/envs/ft-test/lib/python3.11/site-packages/triton/__init__.py", line 8, in <module>
from .runtime import (
File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/__init__.py", line 1, in <module>
from .autotuner import (Autotuner, Config, Heuristics, autotune, heuristics)
File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 9, in <module>
from .jit import KernelInterface
File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/jit.py", line 12, in <module>
from ..runtime.driver import driver
File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/driver.py", line 1, in <module>
from ..backends import backends
File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 50, in <module>
backends = _discover_backends()
^^^^^^^^^^^^^^^^^^^^
File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 43, in _discover_backends
compiler = _load_module(name, os.path.join(root, name, 'compiler.py'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 12, in _load_module
spec.loader.exec_module(module)
File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/intel/compiler.py", line 2, in <module>
from triton._C.libtriton import ir, passes, llvm, intel
ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/envs/ft-test/lib/python3.11/site-packages/triton/_C/libtriton.so)
How to reproduce Steps to reproduce the error:
- following BMG guide --> https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/bmg_quickstart.md
- pytorch test done , and its works fine-->
>>> import torch
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>>
>>> tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
>>> tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
>>> print(torch.matmul(tensor_1, tensor_2).size())
torch.Size([1, 1, 40, 40])
>>>
- followed the fine-tuning doc --> https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/trl-example (tried with both pytroch 4.37 and 4.45)
- got the issue there -->
ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/envs/ft-test/lib/python3.11/site-packages/triton/_C/libtriton.so - Also, believe so. we don't need opeAPI separately, , have tested with and without oneAPI installed:with oneAPI installed and xpu_2.3 (which supposed to be pytorch issue as its compiled for particular xpu version+pytorch)
File "/envs/ft-test/lib/python3.11/site-packages/torch/__init__.py", line 405, in <module>
from torch._C import * # noqa: F403
^^^^^^^^^^^^^^^^^^^^^^
ImportError: /envs/ft-test/lib/python3.11/site-packages/torch/lib/../../../../libsycl.so.8: undefined symbol: urBindlessImagesImportExternalMemoryExp, version LIBUR_LOADER_0.10
Environment information
### w/o oneAPI (xpu_2.6)
PYTHON_VERSION=3.11.12
-----------------------------------------------------------------
transformers=4.45.0
-----------------------------------------------------------------
torch=2.6.0+xpu
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250423
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i9-13900K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 24%
CPU max MHz: 5800.0000
CPU min MHz: 800.0000
-----------------------------------------------------------------
Total CPU Memory: 61.5439 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.10 \n \l
-----------------------------------------------------------------
Linux IMU-LAB1-BMG3-SUT 6.14.0-rc1-custom-rt #9 SMP PREEMPT_RT Mon Mar 31 15:51:25 CEST 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.39.20241101
Build ID: 00000000
Service:
Version: 1.2.39.20241101
Build ID: 00000000
Level Zero Version: 1.20.2
-----------------------------------------------------------------
Driver UUID 32352e30-392e-3332-3936-310000000000
Driver Version 25.09.32961
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu-raytracing 1.0.0-0ubuntu1~24.10~ppa4 amd64 Level Zero Ray Tracing Support library
-----------------------------------------------------------------
env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1100
Flags: bus master, fast devsel, latency 0, IRQ 190, IOMMU group 20
Memory at 84000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 85000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: xe
Kernel modules: xe
------------------------------
### with oneAPI (xpu_2.3)
-----------------------------------------------------------------
PYTHON_VERSION=3.11.12
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250423
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: 13th Gen Intel(R) Core(TM) i9-13900K
CPU family: 6
Model: 183
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 1
Stepping: 1
CPU(s) scaling MHz: 23%
CPU max MHz: 5800.0000
CPU min MHz: 800.0000
-----------------------------------------------------------------
Total CPU Memory: 61.5439 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.10 \n \l
-----------------------------------------------------------------
Linux IMU-LAB1-BMG3-SUT 6.14.0-rc1-custom-rt #9 SMP PREEMPT_RT Mon Mar 31 15:51:25 CEST 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.39.20241101
Build ID: 00000000
Service:
Version: 1.2.39.20241101
Build ID: 00000000
Level Zero Version: 1.20.2
-----------------------------------------------------------------
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver Version 2023.16.12.0.12_195853.xmain-hotfix
Driver UUID 32352e30-392e-3332-3936-310000000000
Driver Version 25.09.32961
-----------------------------------------------------------------
Driver related package version:
ii intel-level-zero-gpu-raytracing 1.0.0-0ubuntu1~24.10~ppa4 amd64 Level Zero Ray Tracing Support library
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1100
Flags: bus master, fast devsel, latency 0, IRQ 190, IOMMU group 20
Memory at 84000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 85000000 [disabled] [size=2M]
Capabilities: <access denied>
Kernel driver in use: xe
Kernel modules: xe
Hi @raj-ritu17 ,
After validating torch tensor multiplication on BMG as here, I run the following to install trl dependencies (since ipex-llm has already been installed in BMG setup before):
pip install transformers==4.45.0 "trl<0.12.0" datasets
pip install peft==0.10.0
pip install bitsandbytes==0.45.1 scipy
Then, without source oneapi (because oneapi has been prebuilt in ipex-llm now), from peft import LoraConfig can success:
Key dependency versions are as below:
accelerate 0.23.0
bigdl-core-xe-all 2.7.0b20250426
bitsandbytes 0.45.1
ipex-llm 2.3.0b20250426
peft 0.10.0
pytorch-triton-xpu 3.2.0
torch 2.6.0+xpu
torchaudio 2.6.0+xpu
torchvision 0.21.0+xpu
transformers 4.45.0
trl 0.11.4
Please pay attention to triton, as your error is throw from it and there are two implementations of triton's XPU backend (intel-xpu-backend-for-triton and pytorch-triton-xpu).
And my ubuntu version is 24.10 and kernel version is 6.15.0-rc2+prerelease10+
@Uxito-Ada Thanks for the update :)
in my opinion issue is not there 'from peft import LoraConfig'; its actually coming from calling the "DistributedType" from the wrong file, I have mentioned those in the last lines (for fine tuning on BMG with xpu_2.6)
- for test-purpose, I have installed the xpu library in a fresh env, following here and validated torch tensor multiplication with no issue. as here, terminal output:
Python 3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from ipex_llm.transformers import AutoModelForCausalLM
/home/rajritu/miniforge3/envs/ft-test/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
>>>
>>> tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
>>> tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
>>> print(torch.matmul(tensor_1, tensor_2).size())
torch.Size([1, 1, 40, 40])
>>>
- issue encountered when we start fine-tuning, following from here, as exact line from this particular import 'from ipex_llm.transformers.qlora'
Python 3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> import transformers
>>> from transformers import AutoTokenizer
>>> from peft import LoraConfig
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
>>> from transformers import BitsAndBytesConfig
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/qlora.py", line 61, in <module>
from ipex_llm.transformers import training_patch
File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/training_patch.py", line 83, in <module>
from transformers.training_args import logger, ParallelMode, DistributedType
ImportError: cannot import name 'DistributedType' from 'transformers.training_args' (/envs/ft-test/lib/python3.11/site-packages/transformers/training_args.py)
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/qlora.py", line 61, in <module>
from ipex_llm.transformers import training_patch
File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/training_patch.py", line 83, in <module>
from transformers.training_args import logger, ParallelMode, DistributedType
ImportError: cannot import name 'DistributedType' from 'transformers.training_args' (/envs/ft-test/lib/python3.11/site-packages/transformers/training_args.py)
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>> from datasets import load_dataset
2025-04-28 15:58:33,213 - INFO - PyTorch version 2.6.0+xpu available.
>>> from trl import SFTTrainer
>>> import argparse
>>>
why this issue appeared:
- as we are calling: https://github.com/intel/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/qlora.py --> + 61 from ipex_llm.transformers import training_patch
- and so on: https://github.com/intel/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/training_patch.py --> +83 from transformers.training_args import logger, ParallelMode, DistributedType
- 'DistributedType' is not implemented in this file or maybe moved in new transformer version --> transformers.training_args.py
how to resolve :
- actual 'DistributedType' implementation is in this file--> accelerate/utils/dataclasses.py
class DistributedType(str, enum.Enum):
"""
Represents a type of distributed environment.
Values:
- **NO** -- Not a distributed environment, just a single process.
- **MULTI_CPU** -- Distributed on multiple CPU nodes.
- **MULTI_GPU** -- Distributed on multiple GPUs.
- **MULTI_NPU** -- Distributed on multiple NPUs.
- **MULTI_XPU** -- Distributed on multiple XPUs.
- **DEEPSPEED** -- Using DeepSpeed.
- **TPU** -- Distributed on TPUs.
"""
# Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
NO = "NO"
MULTI_CPU = "MULTI_CPU"
MULTI_GPU = "MULTI_GPU"
MULTI_NPU = "MULTI_NPU"
MULTI_XPU = "MULTI_XPU"
DEEPSPEED = "DEEPSPEED"
FSDP = "FSDP"
TPU = "TPU"
MEGATRON_LM = "MEGATRON_LM"
- so we can call from accelerator.py, to test we can call this function:
>>> from accelerate import DistributedType
>>>
workaround :
- we must change these import in file --> src/ipex_llm/transformers/training_patch.py
from transformers.training_args import logger, ParallelMode
from accelerate import DistributedType
- also need >> pip install 'accelerate>=0.26.0'
- pip install --pre --upgrade accelerate
- Test are here (after changes) :
>>> import torch
>>> import os
>>>
>>> import transformers
>>> from transformers import AutoTokenizer
>>> from peft import LoraConfig
>>> from transformers import BitsAndBytesConfig
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>> from datasets import load_dataset
>>> from trl import SFTTrainer
>>> import argparse
>>>
Hi @raj-ritu17 ,
Thanks for your analysis. I have reproduced the DistributedType error, which is different from BMG machine issue and we are going to fix it.