training-operator
training-operator copied to clipboard
"ImportError" when running fine-tuning API
What happened?
When I ran the example of the fine-tuning API, the pod failed due to the following error in the "storage_initializer" container:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/app/storage_initializer/storage.py", line 2, in <module>
from .hugging_face import HuggingFace, HuggingFaceDataset
File "/app/storage_initializer/hugging_face.py", line 8, in <module>
from peft import LoraConfig
File "/usr/local/lib/python3.11/site-packages/peft/__init__.py", line 22, in <module>
from .mapping import MODEL_TYPE_TO_PEFT_MODEL_MAPPING, PEFT_TYPE_TO_CONFIG_MAPPING, get_peft_config, get_peft_model
File "/usr/local/lib/python3.11/site-packages/peft/mapping.py", line 16, in <module>
from .peft_model import (
File "/usr/local/lib/python3.11/site-packages/peft/peft_model.py", line 22, in <module>
from accelerate import dispatch_model, infer_auto_device_map
File "/usr/local/lib/python3.11/site-packages/accelerate/__init__.py", line 16, in <module>
from .accelerator import Accelerator
File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 34, in <module>
from huggingface_hub import split_torch_state_dict_into_shards
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/usr/local/lib/python3.11/site-packages/huggingface_hub/__init__.py)
What did you expect to happen?
Successfully finished the example of fine-tuning API.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.30.0
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest%
Training Operator Python SDK version:
$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.7.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /Users/helen/Documents/05_GSoC/training-operator/sdk/python
Editable project location: /Users/helen/Documents/05_GSoC/training-operator/sdk/python
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍