Cannot fine-tune LLM without GPU - CUDA error and DDP initialization
What happened?
I am trying to fine-tune an LLM using Kubeflow without GPU devices. However, I encountered two issues during the process :
-
When I removed the gpu key from resources_per_worker, the training job still attempted to allocate GPUs, resulting in the CUDA error: invalid device ordinal (the training job tried to allocate GPUs)
-
To address this, I tried adding ddp_backend="gloo" to training_parameters. However, this led to another error:
I followed this instruction : https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ . This is the code i ran :
`import transformers from peft import LoraConfig
from kubeflow.training import TrainingClient from kubeflow.storage_initializer.hugging_face import ( HuggingFaceModelParams, HuggingFaceTrainerParams, HuggingFaceDatasetParams, )
TrainingClient().train( name="fine-tune-bert", # BERT model URI and type of Transformer to train it.
storage_config=
{
"size": "5Gi",
"storage_class": "nfs-client",
},
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://google-bert/bert-base-cased",
transformer_type=transformers.AutoModelForSequenceClassification,
),
# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
#repo_id="yelp_review_full",
repo_id="yelp_review_full",
split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
ddp_backend="gloo",
),
# Set LoRA config to reduce number of trainable model parameters.
lora_config=LoraConfig(
r=8,
lora_alpha=8,
lora_dropout=0.1,
bias="none",
),
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=16, # nproc-per-node parameter for torchrun command.
resources_per_worker={
#"gpu": 0,
"cpu": 16,
"memory": "16G",
},
)`
What did you expect to happen?
The training job should correctly initialize without attempting to allocate GPUs.
Environment
Kubernetes version:
$ kubectl version
Client Version: v1.29.6+k3s2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6+k3s2
Training Operator version:
$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest
Training Operator Python SDK version:
$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:
Impacted by this bug?
👍
I also have error [rank0]: ValueError: Please specify target_modules in peft_config . I tried to delete the lora config but that error still exists
` import transformers from peft import LoraConfig
from kubeflow.training import TrainingClient from kubeflow.storage_initializer.hugging_face import ( HuggingFaceModelParams, HuggingFaceTrainerParams, HuggingFaceDatasetParams, )
TrainingClient().train( name="fine-tune-bert", # BERT model URI and type of Transformer to train it.
storage_config=
{
"size": "5Gi",
"storage_class": "nfs-client",
},
model_provider_parameters=HuggingFaceModelParams(
model_uri="hf://distilbert/distilbert-base-uncased",
transformer_type=transformers.AutoModelForSequenceClassification,
),
# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
#repo_id="yelp_review_full",
repo_id="yelp_review_full",
split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
#ddp_backend="gloo",
),
# Set LoRA config to reduce number of trainable model parameters.
#lora_config=LoraConfig(
#r=8,
#lora_alpha=8,
#lora_dropout=0.1,
#bias="none",
#target_modules=["encoder.layer.*.attention.self.query", "encoder.layer.*.attention.self.key"]
#),
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=20, # nproc-per-node parameter for torchrun command.
resources_per_worker={
"cpu": 20,
"memory": "20G",
},
) `
Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.
kubectl get pytorchjob -n <NAMESPACE> -o yaml
Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.
kubectl get pytorchjob -n <NAMESPACE> -o yaml
Hi , this is the output (base) jovyan@ex-0:~$ kubectl get pytorchjobs -n kubeflow-user-example-com -o yaml apiVersion: v1 items:
- apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
creationTimestamp: "2025-01-07T14:10:47Z"
generation: 1
name: fine-tune-bert
namespace: kubeflow-user-example-com
resourceVersion: "580563"
uid: 72ed106c-5299-4de3-9f27-8e5464d4e59b
spec:
nprocPerNode: "20"
pytorchReplicaSpecs:
Master:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --model_uri
- hf://distilbert/distilbert-base-uncased
- --transformer_type
- AutoModelForSequenceClassification
- --num_labels
- None
- --model_dir
- /workspace/model
- --dataset_dir
- /workspace/dataset
- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'
- --training_parameters
- '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
false, "do_eval": false, "do_predict": false, "evaluation_strategy":
"no", "prediction_loss_only": false, "per_device_train_batch_size":
8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
"per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
"eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
"adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
"max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
{}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
"warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
"logging_strategy": "steps", "logging_first_step": false, "logging_steps":
500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
"use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
"O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
"fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
{"split_batches": false, "dispatch_batches": null, "even_batches":
true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
"group_by_length": false, "length_column_name": "length", "report_to":
[], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
"ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
false, "hub_always_push": false, "gradient_checkpointing": false,
"gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
"", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
"torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
false, "neftune_noise_alpha": null}'
image: docker.io/kubeflow/trainer-huggingface
name: pytorch
resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G
volumeMounts:
- mountPath: /workspace
name: storage-initializer
initContainers:
- args:
- --model_provider
- hf
- --model_provider_parameters
- '{"model_uri": "hf://distilbert/distilbert-base-uncased", "transformer_type":
"AutoModelForSequenceClassification", "access_token": null, "num_labels":
null}'
- --dataset_provider
- hf
- --dataset_provider_parameters
- '{"repo_id": "yelp_review_full", "access_token": null, "split": "train[:100]"}'
image: docker.io/kubeflow/storage-initializer
name: storage-initializer
volumeMounts:
- mountPath: /workspace
name: storage-initializer
volumes:
- name: storage-initializer
persistentVolumeClaim:
claimName: storage-initializer
Worker:
replicas: 1
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
containers:
- args:
- --model_uri
- hf://distilbert/distilbert-base-uncased
- --transformer_type
- AutoModelForSequenceClassification
- --num_labels
- None
- --model_dir
- /workspace/model
- --dataset_dir
- /workspace/dataset
- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'
- --training_parameters
- '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train":
false, "do_eval": false, "do_predict": false, "evaluation_strategy":
"no", "prediction_loss_only": false, "per_device_train_batch_size":
8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null,
"per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1,
"eval_accumulation_steps": null, "eval_delay": 0, "learning_rate":
5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999,
"adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0,
"max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs":
{}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica":
"warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0",
"logging_strategy": "steps", "logging_first_step": false, "logging_steps":
500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps":
500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node":
false, "save_only_model": false, "no_cuda": false, "use_cpu": false,
"use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval":
false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level":
"O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval":
false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores":
null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last":
false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor":
null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm":
true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end":
false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip":
false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params":
0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false},
"fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config":
{"split_batches": false, "dispatch_batches": null, "even_batches":
true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor":
0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false,
"group_by_length": false, "length_column_name": "length", "report_to":
[], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null,
"ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers":
false, "skip_memory_metrics": true, "use_legacy_prediction_loop":
false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id":
null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo":
false, "hub_always_push": false, "gradient_checkpointing": false,
"gradient_checkpointing_kwargs": null, "include_inputs_for_metrics":
false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization":
null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters":
"", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo":
null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false,
"torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches":
null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen":
false, "neftune_noise_alpha": null}'
image: docker.io/kubeflow/trainer-huggingface
name: pytorch
resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G
volumeMounts:
- mountPath: /workspace
name: storage-initializer
volumes:
- name: storage-initializer
persistentVolumeClaim:
claimName: storage-initializer
runPolicy:
suspend: false
status:
conditions:
- lastTransitionTime: "2025-01-07T14:10:47Z" lastUpdateTime: "2025-01-07T14:10:47Z" message: PyTorchJob fine-tune-bert is created. reason: PyTorchJobCreated status: "True" type: Created
- lastTransitionTime: "2025-01-07T14:11:16Z" lastUpdateTime: "2025-01-07T14:11:16Z" message: PyTorchJob fine-tune-bert is running. reason: PyTorchJobRunning status: "True" type: Running replicaStatuses: Master: active: 1 selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master Worker: active: 1 selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker startTime: "2025-01-07T14:10:47Z" kind: List metadata: resourceVersion: "" (base) jovyan@ex-0:~$
So, as you can see the GPU has not been allocated to your PyTorch's pod:
resources:
limits:
cpu: 20
memory: 20G
requests:
cpu: 20
memory: 20G
Locally on Kind using MacOS, I was able to run the example on CPU using docker.io/kubeflow/trainer-huggingface image.
Where do you run your Kubernetes cluster ?
Yes but it still has this error when I check with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com
I ran this code in Notebook of Kubeflow UI
Are you using public cloud or on-prem to deploy Kubeflow Control Plane ?
@deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?
Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? @deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running
trainAPI on CPU-based instances ?
I used Jarvice to create a Kubeflow instance.
Do you know which instances do they run for Kubernetes Nodes ? E.g. is it AMD Linux machines with CPUs ?
Sorry , i don't know
@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
ddp_backend="gloo",
use_cpu=True,
),
)
@thuytrang32 Can you also try to set the
use_cpuflag for Trainer args ?trainer_parameters=HuggingFaceTrainerParams( training_parameters=transformers.TrainingArguments( output_dir="test_trainer", save_strategy="no", evaluation_strategy="no", do_eval=False, disable_tqdm=True, log_level="info", ddp_backend="gloo", use_cpu=True, ), )
When I ran with both ddp_backend and use_cpu , it still had the old error
Then I tried to run with use_cpu = True only , the code passed. Then i checked with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com, it had these errors again
For kubectl describe pod fine-tune-bert-worker-0 -n kubeflow-user-example-com , because the worker has GPU , it didn't have CUDA error but it still had this
I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.
I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
training_parameters=transformers.TrainingArguments(
output_dir="test_trainer",
save_strategy="no",
evaluation_strategy="no",
do_eval=False,
disable_tqdm=True,
log_level="info",
no_cuda=True,
use_cpu=True,
),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: https://github.com/huggingface/peft/issues/2128#issuecomment-2396633229). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.
But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?
I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.
I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags
no_cudaanduse_cpulike this:trainer_parameters=HuggingFaceTrainerParams( training_parameters=transformers.TrainingArguments( output_dir="test_trainer", save_strategy="no", evaluation_strategy="no", do_eval=False, disable_tqdm=True, log_level="info", no_cuda=True, use_cpu=True, ), )Regarding the error
ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define thetarget_modulesmanually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.
It didn't work even though i put both no_cuda=True, use_cpu = True
And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?
But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?
Oh I see. It seems that when lora_config is not explicitly set, the API assigns its default values and passes them into the container, as shown in the output of kubectl get pytorchjob -n <NAMESPACE> -o yaml:
- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'
As a result, the trainer still attempts to configure the PEFT model as indicated in the script: https://github.com/kubeflow/training-operator/blob/25c760c75673c93700de2ec5e10a95b5ad4e4b18/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L130
It seems that even without specifying lora_config, it is still included in the fine-tuning process. Could you try setting lora_config and explicitly defining target_modules to see if that resolves the issue?
Meanwhile, @andreyvelich do you think this might be a bug?
I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected. I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags
no_cudaanduse_cpulike this:trainer_parameters=HuggingFaceTrainerParams( training_parameters=transformers.TrainingArguments( output_dir="test_trainer", save_strategy="no", evaluation_strategy="no", do_eval=False, disable_tqdm=True, log_level="info", no_cuda=True, use_cpu=True, ), )Regarding the error
ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define thetarget_modulesmanually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.It didn't work even though i put both no_cuda=True, use_cpu = True
And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?
For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?
Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?
I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected. I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags
no_cudaanduse_cpulike this:trainer_parameters=HuggingFaceTrainerParams( training_parameters=transformers.TrainingArguments( output_dir="test_trainer", save_strategy="no", evaluation_strategy="no", do_eval=False, disable_tqdm=True, log_level="info", no_cuda=True, use_cpu=True, ), )Regarding the error
ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define thetarget_modulesmanually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.It didn't work even though i put both no_cuda=True, use_cpu = True
And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?
For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?
Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?
I used another cluster without a GPU. The training process is now running, but it encountered a 'Target is out of bounds' error.
@andreyvelich @deepanker13 @johnugeorge @saileshd1402
Meanwhile, @andreyvelich do you think this might be a bug?
Yes, I think we should fix it if users want to use train API without LoRA.
cc @deepanker13 @saileshd1402 @johnugeorge
@thuytrang32 Please can you try the CPU image for your Trainer ? You can use image that I built locally:
export TRAINER_TRANSFORMER_IMAGE=docker.io/andreyvelichkevich/llm-trainer-cpu
@thuytrang32 can you please share the results by using less number of num_procs_per_worker maybe 2 for now, reducing cpu in resource_per_worker to 8, setting ddp_backend="gloo" and removing gpu key from resources. Maybe it is a resource issue.
Hi @deepanker13 @andreyvelich @helenxie-bit , I used new cluster without a GPU and tried to run the code again, it worked well.
I noticed that the HuggingFaceModelParams class in the Kubeflow pipeline currently supports only specific transformer model types, such as: sequence classification, token classification, question answering, causal language modeling, masked language modeling, and image classification
However, I am working on differents tasks. Could you confirm if there are any plans to support additional transformer types such as AutoModelForVision2Seq or similar models for image captioning?
Thank you for your assistance.
That is great to hear @thuytrang32! We are designing a new LLM Trainer as part of Kubeflow Trainer V2 effort: https://github.com/kubeflow/training-operator/issues/2401. cc @Electronic-Waste We are planning to design LLM runtimes with pre-created Trainer for which you can override parameters.
It would be nice if you could explain your use-case in the doc or in the issue, so we can keep it in mind.
Hi @andreyvelich ,
I don’t have a specific production use case at the moment—my primary goal is to test the Kubeflow pipeline's capacity to fine-tune generative AI models across different transformer types.
I have successfully tested your pipeline with a BERT model for text classification. However, I encountered an issue when trying to test the pipeline with AutoModelForCausalLM using the openwebtext, bookcorpus dataset. The HuggingFaceDatasetParams class does not support the trust_remote_code=True parameter, which is required to download certain datasets. I also tested with wikitext dataset and it didn't work either with error "Config name is missing"
Wikitext
openwebtext
bookcorpus
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
@Electronic-Waste: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Hi @thuytrang32! Please try to use the V2 examples for fine-tuning: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/question-answering/fine-tune-distilbert.ipynb It should have much more stable orchestration. /close
@andreyvelich: Closing this issue.
In response to this:
Hi @thuytrang32! Please try to use the V2 examples for fine-tuning: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/question-answering/fine-tune-distilbert.ipynb It should have much more stable orchestration. /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

