training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

"ImportError" when running fine-tuning API

Open helenxie-bit opened this issue 7 months ago • 0 comments

What happened?

When I ran the example of the fine-tuning API, the pod failed due to the following error in the "storage_initializer" container:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/app/storage_initializer/storage.py", line 2, in <module>
    from .hugging_face import HuggingFace, HuggingFaceDataset
  File "/app/storage_initializer/hugging_face.py", line 8, in <module>
    from peft import LoraConfig
  File "/usr/local/lib/python3.11/site-packages/peft/__init__.py", line 22, in <module>
    from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
  File "/usr/local/lib/python3.11/site-packages/peft/mapping.py", line 16, in <module>
    from .peft_model import (
  File "/usr/local/lib/python3.11/site-packages/peft/peft_model.py", line 22, in <module>
    from accelerate import dispatch_model, infer_auto_device_map
  File "/usr/local/lib/python3.11/site-packages/accelerate/__init__.py", line 16, in <module>
    from .accelerator import Accelerator
  File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 34, in <module>
    from huggingface_hub import split_torch_state_dict_into_shards
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/usr/local/lib/python3.11/site-packages/huggingface_hub/__init__.py)

What did you expect to happen?

Successfully finished the example of fine-tuning API.

Environment

Kubernetes version:

$ kubectl version

Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

kubeflow/training-operator:latest%

Training Operator Python SDK version:

$ pip show kubeflow-training

Name: kubeflow-training
Version: 1.7.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /Users/helen/Documents/05_GSoC/training-operator/sdk/python
Editable project location: /Users/helen/Documents/05_GSoC/training-operator/sdk/python
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

helenxie-bit avatar Jul 21 '24 05:07 helenxie-bit