ipex-llm [XPU] library mismatch and version issue while performing fine-tuning on B580

Describe the bug performing, fine-tuning of llm-model, on BattleImage, facing conflicting with library issue, specially between transformer and supported bitsandbyte.

Traceback (most recent call last):
  File "./qlora_finetuning.py", line 22, in <module>
    from peft import LoraConfig
  File "/envs/ft-test/lib/python3.11/site-packages/peft/__init__.py", line 22, in <module>
    from .auto import (
  File "/envs/ft-test/lib/python3.11/site-packages/peft/auto.py", line 31, in <module>
    from .config import PeftConfig
  File "/envs/ft-test/lib/python3.11/site-packages/peft/config.py", line 23, in <module>
    from .utils import CONFIG_NAME, PeftType, TaskType
  File "/envs/ft-test/lib/python3.11/site-packages/peft/utils/__init__.py", line 21, in <module>
    from .loftq_utils import replace_lora_weights_loftq
  File "/envs/ft-test/lib/python3.11/site-packages/peft/utils/loftq_utils.py", line 35, in <module>
    import bitsandbytes as bnb
  File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/__init__.py", line 15, in <module>
    from .nn import modules
  File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/nn/__init__.py", line 21, in <module>
    from .triton_based_modules import (
  File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/nn/triton_based_modules.py", line 6, in <module>
    from bitsandbytes.triton.dequantize_rowwise import dequantize_rowwise
  File "/envs/ft-test/lib/python3.11/site-packages/bitsandbytes/triton/dequantize_rowwise.py", line 12, in <module>
    import triton
  File "/envs/ft-test/lib/python3.11/site-packages/triton/__init__.py", line 8, in <module>
    from .runtime import (
  File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/__init__.py", line 1, in <module>
    from .autotuner import (Autotuner, Config, Heuristics, autotune, heuristics)
  File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/autotuner.py", line 9, in <module>
    from .jit import KernelInterface
  File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/jit.py", line 12, in <module>
    from ..runtime.driver import driver
  File "/envs/ft-test/lib/python3.11/site-packages/triton/runtime/driver.py", line 1, in <module>
    from ..backends import backends
  File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 50, in <module>
    backends = _discover_backends()
               ^^^^^^^^^^^^^^^^^^^^
  File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 43, in _discover_backends
    compiler = _load_module(name, os.path.join(root, name, 'compiler.py'))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/__init__.py", line 12, in _load_module
    spec.loader.exec_module(module)
  File "/envs/ft-test/lib/python3.11/site-packages/triton/backends/intel/compiler.py", line 2, in <module>
    from triton._C.libtriton import ir, passes, llvm, intel
ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/envs/ft-test/lib/python3.11/site-packages/triton/_C/libtriton.so)

How to reproduce Steps to reproduce the error:

following BMG guide --> https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/bmg_quickstart.md
pytorch test done , and its works fine-->

 >>> import torch
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>>
>>> tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
>>> tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
>>> print(torch.matmul(tensor_1, tensor_2).size())
torch.Size([1, 1, 40, 40])
>>>

followed the fine-tuning doc --> https://github.com/intel/ipex-llm/tree/main/python/llm/example/GPU/LLM-Finetuning/QLoRA/trl-example (tried with both pytroch 4.37 and 4.45)
got the issue there --> ImportError: cannot import name 'intel' from 'triton._C.libtriton' (/envs/ft-test/lib/python3.11/site-packages/triton/_C/libtriton.so
Also, believe so. we don't need opeAPI separately, , have tested with and without oneAPI installed:with oneAPI installed and xpu_2.3 (which supposed to be pytorch issue as its compiled for particular xpu version+pytorch)

  File "/envs/ft-test/lib/python3.11/site-packages/torch/__init__.py", line 405, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: /envs/ft-test/lib/python3.11/site-packages/torch/lib/../../../../libsycl.so.8: undefined symbol: urBindlessImagesImportExternalMemoryExp, version LIBUR_LOADER_0.10

Environment information

### w/o oneAPI (xpu_2.6)

  PYTHON_VERSION=3.11.12
-----------------------------------------------------------------
transformers=4.45.0
-----------------------------------------------------------------
torch=2.6.0+xpu
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250423
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                           6
Model:                                183
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             1
CPU(s) scaling MHz:                   24%
CPU max MHz:                          5800.0000
CPU min MHz:                          800.0000
-----------------------------------------------------------------
Total CPU Memory: 61.5439 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.10 \n \l

-----------------------------------------------------------------
Linux IMU-LAB1-BMG3-SUT 6.14.0-rc1-custom-rt #9 SMP PREEMPT_RT Mon Mar 31 15:51:25 CEST 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.39.20241101
    Build ID: 00000000

Service:
    Version: 1.2.39.20241101
    Build ID: 00000000
    Level Zero Version: 1.20.2
-----------------------------------------------------------------
  Driver UUID                                     32352e30-392e-3332-3936-310000000000
  Driver Version                                  25.09.32961
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu-raytracing                1.0.0-0ubuntu1~24.10~ppa4                amd64        Level Zero Ray Tracing Support library
-----------------------------------------------------------------
env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1100
        Flags: bus master, fast devsel, latency 0, IRQ 190, IOMMU group 20
        Memory at 84000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 85000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: xe
        Kernel modules: xe
------------------------------

### with oneAPI (xpu_2.3)

-----------------------------------------------------------------
PYTHON_VERSION=3.11.12
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm Version: 2.3.0b20250423
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            GenuineIntel
Model name:                           13th Gen Intel(R) Core(TM) i9-13900K
CPU family:                           6
Model:                                183
Thread(s) per core:                   2
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             1
CPU(s) scaling MHz:                   23%
CPU max MHz:                          5800.0000
CPU min MHz:                          800.0000
-----------------------------------------------------------------
Total CPU Memory: 61.5439 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 24.10 \n \l

-----------------------------------------------------------------
Linux IMU-LAB1-BMG3-SUT 6.14.0-rc1-custom-rt #9 SMP PREEMPT_RT Mon Mar 31 15:51:25 CEST 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.39.20241101
    Build ID: 00000000

Service:
    Version: 1.2.39.20241101
    Build ID: 00000000
    Level Zero Version: 1.20.2
-----------------------------------------------------------------
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver Version                                  2023.16.12.0.12_195853.xmain-hotfix
  Driver UUID                                     32352e30-392e-3332-3936-310000000000
  Driver Version                                  25.09.32961
-----------------------------------------------------------------
Driver related package version:
ii  intel-level-zero-gpu-raytracing                1.0.0-0ubuntu1~24.10~ppa4                amd64        Level Zero Ray Tracing Support library
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
No device discovered
GPU0 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Intel Graphics] (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1100
        Flags: bus master, fast devsel, latency 0, IRQ 190, IOMMU group 20
        Memory at 84000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 85000000 [disabled] [size=2M]
        Capabilities: <access denied>
        Kernel driver in use: xe
        Kernel modules: xe

Apr 24 '25 16:04 raj-ritu17

Hi @raj-ritu17 ,

After validating torch tensor multiplication on BMG as here, I run the following to install trl dependencies (since ipex-llm has already been installed in BMG setup before):

pip install transformers==4.45.0 "trl<0.12.0" datasets
pip install peft==0.10.0
pip install bitsandbytes==0.45.1 scipy

Then, without source oneapi (because oneapi has been prebuilt in ipex-llm now), from peft import LoraConfig can success:

Key dependency versions are as below:

accelerate         0.23.0
bigdl-core-xe-all  2.7.0b20250426
bitsandbytes       0.45.1
ipex-llm           2.3.0b20250426
peft               0.10.0
pytorch-triton-xpu 3.2.0
torch              2.6.0+xpu
torchaudio         2.6.0+xpu
torchvision        0.21.0+xpu
transformers       4.45.0
trl                0.11.4

Please pay attention to triton, as your error is throw from it and there are two implementations of triton's XPU backend (intel-xpu-backend-for-triton and pytorch-triton-xpu).

And my ubuntu version is 24.10 and kernel version is 6.15.0-rc2+prerelease10+

Apr 27 '25 07:04 Uxito-Ada

@Uxito-Ada Thanks for the update :)

in my opinion issue is not there 'from peft import LoraConfig'; its actually coming from calling the "DistributedType" from the wrong file, I have mentioned those in the last lines (for fine tuning on BMG with xpu_2.6)

for test-purpose, I have installed the xpu library in a fresh env, following here and validated torch tensor multiplication with no issue. as here, terminal output:

Python 3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> from ipex_llm.transformers import AutoModelForCausalLM
/home/rajritu/miniforge3/envs/ft-test/lib/python3.11/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
>>>
>>> tensor_1 = torch.randn(1, 1, 40, 128).to('xpu')
>>> tensor_2 = torch.randn(1, 1, 128, 40).to('xpu')
>>> print(torch.matmul(tensor_1, tensor_2).size())
torch.Size([1, 1, 40, 40])
>>>

issue encountered when we start fine-tuning, following from here, as exact line from this particular import 'from ipex_llm.transformers.qlora'

Python 3.11.12 | packaged by conda-forge | (main, Apr 10 2025, 22:23:25) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> import os
>>> import transformers
>>> from transformers import AutoTokenizer
>>> from peft import LoraConfig
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
>>> from transformers import BitsAndBytesConfig
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/qlora.py", line 61, in <module>
    from ipex_llm.transformers import training_patch
  File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/training_patch.py", line 83, in <module>
    from transformers.training_args import logger, ParallelMode, DistributedType
ImportError: cannot import name 'DistributedType' from 'transformers.training_args' (/envs/ft-test/lib/python3.11/site-packages/transformers/training_args.py)
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/qlora.py", line 61, in <module>
    from ipex_llm.transformers import training_patch
  File "/envs/ft-test/lib/python3.11/site-packages/ipex_llm/transformers/training_patch.py", line 83, in <module>
    from transformers.training_args import logger, ParallelMode, DistributedType
ImportError: cannot import name 'DistributedType' from 'transformers.training_args' (/envs/ft-test/lib/python3.11/site-packages/transformers/training_args.py)
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>> from datasets import load_dataset
2025-04-28 15:58:33,213 - INFO - PyTorch version 2.6.0+xpu available.
>>> from trl import SFTTrainer
>>> import argparse
>>>

why this issue appeared:

as we are calling: https://github.com/intel/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/qlora.py --> + 61 from ipex_llm.transformers import training_patch
and so on: https://github.com/intel/ipex-llm/blob/main/python/llm/src/ipex_llm/transformers/training_patch.py --> +83 from transformers.training_args import logger, ParallelMode, DistributedType
'DistributedType' is not implemented in this file or maybe moved in new transformer version --> transformers.training_args.py

how to resolve :

actual 'DistributedType' implementation is in this file--> accelerate/utils/dataclasses.py

class DistributedType(str, enum.Enum):
    """
    Represents a type of distributed environment.

    Values:

        - **NO** -- Not a distributed environment, just a single process.
        - **MULTI_CPU** -- Distributed on multiple CPU nodes.
        - **MULTI_GPU** -- Distributed on multiple GPUs.
        - **MULTI_NPU** -- Distributed on multiple NPUs.
        - **MULTI_XPU** -- Distributed on multiple XPUs.
        - **DEEPSPEED** -- Using DeepSpeed.
        - **TPU** -- Distributed on TPUs.
    """

    # Subclassing str as well as Enum allows the `DistributedType` to be JSON-serializable out of the box.
    NO = "NO"
    MULTI_CPU = "MULTI_CPU"
    MULTI_GPU = "MULTI_GPU"
    MULTI_NPU = "MULTI_NPU"
    MULTI_XPU = "MULTI_XPU"
    DEEPSPEED = "DEEPSPEED"
    FSDP = "FSDP"
    TPU = "TPU"
    MEGATRON_LM = "MEGATRON_LM"

so we can call from accelerator.py, to test we can call this function:

>>> from accelerate import DistributedType
>>>

workaround :

we must change these import in file --> src/ipex_llm/transformers/training_patch.py

from transformers.training_args import logger, ParallelMode
from accelerate import DistributedType

also need >> pip install 'accelerate>=0.26.0'
pip install --pre --upgrade accelerate
Test are here (after changes) :

>>> import torch
>>> import os
>>>
>>> import transformers
>>> from transformers import AutoTokenizer
>>> from peft import LoraConfig
>>> from transformers import BitsAndBytesConfig
>>> from ipex_llm.transformers.qlora import get_peft_model, prepare_model_for_kbit_training
>>> from ipex_llm.transformers import AutoModelForCausalLM
>>> from datasets import load_dataset
>>> from trl import SFTTrainer
>>> import argparse
>>>

Apr 28 '25 17:04 raj-ritu17

Hi @raj-ritu17 ,

Thanks for your analysis. I have reproduced the DistributedType error, which is different from BMG machine issue and we are going to fix it.

Apr 29 '25 01:04 Uxito-Ada