training-operator Cannot fine-tune LLM without GPU - CUDA error and DDP initialization

What happened?

I am trying to fine-tune an LLM using Kubeflow without GPU devices. However, I encountered two issues during the process :

When I removed the gpu key from resources_per_worker, the training job still attempted to allocate GPUs, resulting in the CUDA error: invalid device ordinal (the training job tried to allocate GPUs)
To address this, I tried adding ddp_backend="gloo" to training_parameters. However, this led to another error:

I followed this instruction : https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/ . This is the code i ran :

`import transformers from peft import LoraConfig

from kubeflow.training import TrainingClient from kubeflow.storage_initializer.hugging_face import ( HuggingFaceModelParams, HuggingFaceTrainerParams, HuggingFaceDatasetParams, )

TrainingClient().train( name="fine-tune-bert", # BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://google-bert/bert-base-cased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
    ),
    # Set LoRA config to reduce number of trainable model parameters.
    lora_config=LoraConfig(
        r=8,
        lora_alpha=8,
        lora_dropout=0.1,
        bias="none",
    ),
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=16, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    #"gpu": 0,
    "cpu": 16,
    "memory": "16G",
},

)`

What did you expect to happen?

The training job should correctly initialize without attempting to allocate GPUs.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.29.6+k3s2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6+k3s2

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:latest

Training Operator Python SDK version:

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:

Impacted by this bug?

👍

Jan 07 '25 10:01 thuytrang32

I also have error [rank0]: ValueError: Please specify target_modules in peft_config . I tried to delete the lora config but that error still exists

` import transformers from peft import LoraConfig

from kubeflow.training import TrainingClient from kubeflow.storage_initializer.hugging_face import ( HuggingFaceModelParams, HuggingFaceTrainerParams, HuggingFaceDatasetParams, )

TrainingClient().train( name="fine-tune-bert", # BERT model URI and type of Transformer to train it.

storage_config=
{
      "size": "5Gi",
      "storage_class": "nfs-client",
},

model_provider_parameters=HuggingFaceModelParams(
    model_uri="hf://distilbert/distilbert-base-uncased",
    transformer_type=transformers.AutoModelForSequenceClassification,
),

# Use 3000 samples from Yelp dataset.
dataset_provider_parameters=HuggingFaceDatasetParams(
    #repo_id="yelp_review_full",
    repo_id="yelp_review_full",
    split="train[:100]",
),
# Specify HuggingFace Trainer parameters. In this example, we will skip evaluation and model checkpoints.
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        #ddp_backend="gloo",
    ),
    
    # Set LoRA config to reduce number of trainable model parameters.
    
    #lora_config=LoraConfig(
        #r=8,
        #lora_alpha=8,
        #lora_dropout=0.1,
        #bias="none",
        #target_modules=["encoder.layer.*.attention.self.query", "encoder.layer.*.attention.self.key"]
    #),
    
),
num_workers=2, # nnodes parameter for torchrun command.
num_procs_per_worker=20, # nproc-per-node parameter for torchrun command.
resources_per_worker={
    "cpu": 20,
    "memory": "20G",
},

) `

Jan 07 '25 14:01 thuytrang32

Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.

kubectl get pytorchjob -n <NAMESPACE> -o yaml

Jan 07 '25 14:01 andreyvelich

Thank you for creating this! For the first error, please can you check the PyTorchJob ? It should create it without GPU resources.
kubectl get pytorchjob -n <NAMESPACE> -o yaml

Hi , this is the output (base) jovyan@ex-0:~$ kubectl get pytorchjobs -n kubeflow-user-example-com -o yaml apiVersion: v1 items:

apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: creationTimestamp: "2025-01-07T14:10:47Z" generation: 1 name: fine-tune-bert namespace: kubeflow-user-example-com resourceVersion: "580563" uid: 72ed106c-5299-4de3-9f27-8e5464d4e59b spec: nprocPerNode: "20" pytorchReplicaSpecs: Master: replicas: 1 template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - args: - --model_uri - hf://distilbert/distilbert-base-uncased - --transformer_type - AutoModelForSequenceClassification - --num_labels - None - --model_dir - /workspace/model - --dataset_dir - /workspace/dataset - --lora_config - '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type": null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha": null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none", "modules_to_save": null, "init_lora_weights": true}' - --training_parameters - '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train": false, "do_eval": false, "do_predict": false, "evaluation_strategy": "no", "prediction_loss_only": false, "per_device_train_batch_size": 8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null, "per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1, "eval_accumulation_steps": null, "eval_delay": 0, "learning_rate": 5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999, "adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0, "max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs": {}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica": "warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0", "logging_strategy": "steps", "logging_first_step": false, "logging_steps": 500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps": 500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node": false, "save_only_model": false, "no_cuda": false, "use_cpu": false, "use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval": false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level": "O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval": false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores": null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last": false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor": null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm": true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end": false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip": false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params": 0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false}, "fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config": {"split_batches": false, "dispatch_batches": null, "even_batches": true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor": 0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false, "group_by_length": false, "length_column_name": "length", "report_to": [], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null, "ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers": false, "skip_memory_metrics": true, "use_legacy_prediction_loop": false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id": null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo": false, "hub_always_push": false, "gradient_checkpointing": false, "gradient_checkpointing_kwargs": null, "include_inputs_for_metrics": false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization": null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters": "", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo": null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false, "torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches": null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen": false, "neftune_noise_alpha": null}' image: docker.io/kubeflow/trainer-huggingface name: pytorch resources: limits: cpu: 20 memory: 20G requests: cpu: 20 memory: 20G volumeMounts: - mountPath: /workspace name: storage-initializer initContainers: - args: - --model_provider - hf - --model_provider_parameters - '{"model_uri": "hf://distilbert/distilbert-base-uncased", "transformer_type": "AutoModelForSequenceClassification", "access_token": null, "num_labels": null}' - --dataset_provider - hf - --dataset_provider_parameters - '{"repo_id": "yelp_review_full", "access_token": null, "split": "train[:100]"}' image: docker.io/kubeflow/storage-initializer name: storage-initializer volumeMounts: - mountPath: /workspace name: storage-initializer volumes: - name: storage-initializer persistentVolumeClaim: claimName: storage-initializer Worker: replicas: 1 template: metadata: annotations: sidecar.istio.io/inject: "false" spec: containers: - args: - --model_uri - hf://distilbert/distilbert-base-uncased - --transformer_type - AutoModelForSequenceClassification - --num_labels - None - --model_dir - /workspace/model - --dataset_dir - /workspace/dataset - --lora_config - '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type": null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha": null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none", "modules_to_save": null, "init_lora_weights": true}' - --training_parameters - '{"output_dir": "test_trainer", "overwrite_output_dir": false, "do_train": false, "do_eval": false, "do_predict": false, "evaluation_strategy": "no", "prediction_loss_only": false, "per_device_train_batch_size": 8, "per_device_eval_batch_size": 8, "per_gpu_train_batch_size": null, "per_gpu_eval_batch_size": null, "gradient_accumulation_steps": 1, "eval_accumulation_steps": null, "eval_delay": 0, "learning_rate": 5e-05, "weight_decay": 0.0, "adam_beta1": 0.9, "adam_beta2": 0.999, "adam_epsilon": 1e-08, "max_grad_norm": 1.0, "num_train_epochs": 3.0, "max_steps": -1, "lr_scheduler_type": "linear", "lr_scheduler_kwargs": {}, "warmup_ratio": 0.0, "warmup_steps": 0, "log_level": "info", "log_level_replica": "warning", "log_on_each_node": true, "logging_dir": "test_trainer/runs/Jan07_14-10-47_ex-0", "logging_strategy": "steps", "logging_first_step": false, "logging_steps": 500, "logging_nan_inf_filter": true, "save_strategy": "no", "save_steps": 500, "save_total_limit": null, "save_safetensors": true, "save_on_each_node": false, "save_only_model": false, "no_cuda": false, "use_cpu": false, "use_mps_device": false, "seed": 42, "data_seed": null, "jit_mode_eval": false, "use_ipex": false, "bf16": false, "fp16": false, "fp16_opt_level": "O1", "half_precision_backend": "auto", "bf16_full_eval": false, "fp16_full_eval": false, "tf32": null, "local_rank": 0, "ddp_backend": null, "tpu_num_cores": null, "tpu_metrics_debug": false, "debug": [], "dataloader_drop_last": false, "eval_steps": null, "dataloader_num_workers": 0, "dataloader_prefetch_factor": null, "past_index": -1, "run_name": "test_trainer", "disable_tqdm": true, "remove_unused_columns": true, "label_names": null, "load_best_model_at_end": false, "metric_for_best_model": null, "greater_is_better": null, "ignore_data_skip": false, "fsdp": [], "fsdp_min_num_params": 0, "fsdp_config": {"min_num_params": 0, "xla": false, "xla_fsdp_v2": false, "xla_fsdp_grad_ckpt": false}, "fsdp_transformer_layer_cls_to_wrap": null, "accelerator_config": {"split_batches": false, "dispatch_batches": null, "even_batches": true, "use_seedable_sampler": true}, "deepspeed": null, "label_smoothing_factor": 0.0, "optim": "adamw_torch", "optim_args": null, "adafactor": false, "group_by_length": false, "length_column_name": "length", "report_to": [], "ddp_find_unused_parameters": null, "ddp_bucket_cap_mb": null, "ddp_broadcast_buffers": null, "dataloader_pin_memory": true, "dataloader_persistent_workers": false, "skip_memory_metrics": true, "use_legacy_prediction_loop": false, "push_to_hub": false, "resume_from_checkpoint": null, "hub_model_id": null, "hub_strategy": "every_save", "hub_token": "<HUB_TOKEN>", "hub_private_repo": false, "hub_always_push": false, "gradient_checkpointing": false, "gradient_checkpointing_kwargs": null, "include_inputs_for_metrics": false, "fp16_backend": "auto", "push_to_hub_model_id": null, "push_to_hub_organization": null, "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>", "mp_parameters": "", "auto_find_batch_size": false, "full_determinism": false, "torchdynamo": null, "ray_scope": "last", "ddp_timeout": 1800, "torch_compile": false, "torch_compile_backend": null, "torch_compile_mode": null, "dispatch_batches": null, "split_batches": null, "include_tokens_per_second": false, "include_num_input_tokens_seen": false, "neftune_noise_alpha": null}' image: docker.io/kubeflow/trainer-huggingface name: pytorch resources: limits: cpu: 20 memory: 20G requests: cpu: 20 memory: 20G volumeMounts: - mountPath: /workspace name: storage-initializer volumes: - name: storage-initializer persistentVolumeClaim: claimName: storage-initializer runPolicy: suspend: false status: conditions:
- lastTransitionTime: "2025-01-07T14:10:47Z" lastUpdateTime: "2025-01-07T14:10:47Z" message: PyTorchJob fine-tune-bert is created. reason: PyTorchJobCreated status: "True" type: Created
- lastTransitionTime: "2025-01-07T14:11:16Z" lastUpdateTime: "2025-01-07T14:11:16Z" message: PyTorchJob fine-tune-bert is running. reason: PyTorchJobRunning status: "True" type: Running replicaStatuses: Master: active: 1 selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=master Worker: active: 1 selector: training.kubeflow.org/job-name=fine-tune-bert,training.kubeflow.org/operator-name=pytorchjob-controller,training.kubeflow.org/replica-type=worker startTime: "2025-01-07T14:10:47Z" kind: List metadata: resourceVersion: "" (base) jovyan@ex-0:~$

Jan 07 '25 14:01 thuytrang32

So, as you can see the GPU has not been allocated to your PyTorch's pod:

resources:
  limits:
    cpu: 20
    memory: 20G
  requests:
    cpu: 20
    memory: 20G

Locally on Kind using MacOS, I was able to run the example on CPU using docker.io/kubeflow/trainer-huggingface image.

Where do you run your Kubernetes cluster ?

Jan 07 '25 14:01 andreyvelich

Yes but it still has this error when I check with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com

I ran this code in Notebook of Kubeflow UI

Jan 07 '25 15:01 thuytrang32

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? @deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

Jan 07 '25 15:01 andreyvelich

Are you using public cloud or on-prem to deploy Kubeflow Control Plane ? @deepanker13 @helenxie-bit @johnugeorge @saileshd1402 Did you see these errors while running train API on CPU-based instances ?

I used Jarvice to create a Kubeflow instance.

Jan 07 '25 15:01 thuytrang32

Do you know which instances do they run for Kubernetes Nodes ? E.g. is it AMD Linux machines with CPUs ?

Jan 07 '25 15:01 andreyvelich

Sorry , i don't know

Jan 07 '25 15:01 thuytrang32

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

Jan 07 '25 15:01 andreyvelich

@thuytrang32 Can you also try to set the use_cpu flag for Trainer args ?

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        ddp_backend="gloo",
        use_cpu=True,
    ),
)

When I ran with both ddp_backend and use_cpu , it still had the old error

Then I tried to run with use_cpu = True only , the code passed. Then i checked with kubectl logs fine-tune-bert-master-0 -n kubeflow-user-example-com, it had these errors again

For kubectl describe pod fine-tune-bert-worker-0 -n kubeflow-user-example-com , because the worker has GPU , it didn't have CUDA error but it still had this

Jan 07 '25 15:01 thuytrang32

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:

trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)

Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: https://github.com/huggingface/peft/issues/2128#issuecomment-2396633229). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

Jan 07 '25 17:01 helenxie-bit

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

Jan 07 '25 18:01 thuytrang32

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected.

I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.

It didn't work even though i put both no_cuda=True, use_cpu = True

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?

Jan 07 '25 18:01 thuytrang32

But target_module is just parameter of LoraConfig, why i already tried not to use lora_config = LoraConfig(....) but the error still existed ?

Oh I see. It seems that when lora_config is not explicitly set, the API assigns its default values and passes them into the container, as shown in the output of kubectl get pytorchjob -n <NAMESPACE> -o yaml:

- --lora_config
- '{"peft_type": "LORA", "base_model_name_or_path": null, "task_type":
null, "inference_mode": false, "r": 8, "target_modules": null, "lora_alpha":
null, "lora_dropout": null, "fan_in_fan_out": false, "bias": "none",
"modules_to_save": null, "init_lora_weights": true}'

As a result, the trainer still attempts to configure the PEFT model as indicated in the script: https://github.com/kubeflow/training-operator/blob/25c760c75673c93700de2ec5e10a95b5ad4e4b18/sdk/python/kubeflow/trainer/hf_llm_training.py#L118-L130

It seems that even without specifying lora_config, it is still included in the fine-tuning process. Could you try setting lora_config and explicitly defining target_modules to see if that resolves the issue?

Meanwhile, @andreyvelich do you think this might be a bug?

Jan 07 '25 23:01 helenxie-bit

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected. I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.
It didn't work even though i put both no_cuda=True, use_cpu = True

And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?

For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?

Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?

Jan 07 '25 23:01 helenxie-bit

I ran the example from the tutorial (https://www.kubeflow.org/docs/components/training/user-guides/fine-tuning/) on MacOS with cpu only, and it worked as expected. I guess the CUDA error occurs because it will automatically detect the available device and tries to use CUDA if it's available. Can you try explicitly setting both the flags no_cuda and use_cpu like this:
trainer_parameters=HuggingFaceTrainerParams(
    training_parameters=transformers.TrainingArguments(
        output_dir="test_trainer",
        save_strategy="no",
        evaluation_strategy="no",
        do_eval=False,
        disable_tqdm=True,
        log_level="info",
        no_cuda=True,
        use_cpu=True,
    ),
)
Regarding the error ValueError: Please specify 'target_modules' in 'peft_config', This likely occurs because the model you are using is not one of the standard architectures supported in PEFT (Reference: huggingface/peft#2128 (comment)). You will need to define the target_modules manually for your specific model. Here's a relevant discussion that might help: https://stackoverflow.com/questions/76768226/target-modules-for-applying-peft-lora-on-different-models.
It didn't work even though i put both no_cuda=True, use_cpu = True And i saw that the size of Bert model is only 1.3GB , why i already set memory per worker is 20GB but it's still not enough ?
For the CUDA issue, sorry I don’t have a solution at the moment. It might be related to the base image used in the trainer. @andreyvelich @deepanker13 @johnugeorge @saileshd1402 Do you have any insights on this?

Regarding the memory issue, the trainer image is quite large, so please ensure the device has at least 10GB of available memory. It could be a memory constraint on the device you’re using. Could you confirm if the device meets this requirement?

I used another cluster without a GPU. The training process is now running, but it encountered a 'Target is out of bounds' error.

@andreyvelich @deepanker13 @johnugeorge @saileshd1402

Jan 10 '25 12:01 thuytrang32

Meanwhile, @andreyvelich do you think this might be a bug?

Yes, I think we should fix it if users want to use train API without LoRA. cc @deepanker13 @saileshd1402 @johnugeorge

Jan 10 '25 18:01 andreyvelich

@thuytrang32 Please can you try the CPU image for your Trainer ? You can use image that I built locally:

export TRAINER_TRANSFORMER_IMAGE=docker.io/andreyvelichkevich/llm-trainer-cpu

Jan 10 '25 18:01 andreyvelich

@thuytrang32 can you please share the results by using less number of num_procs_per_worker maybe 2 for now, reducing cpu in resource_per_worker to 8, setting ddp_backend="gloo" and removing gpu key from resources. Maybe it is a resource issue.

Jan 13 '25 07:01 deepanker13

Hi @deepanker13 @andreyvelich @helenxie-bit , I used new cluster without a GPU and tried to run the code again, it worked well.

I noticed that the HuggingFaceModelParams class in the Kubeflow pipeline currently supports only specific transformer model types, such as: sequence classification, token classification, question answering, causal language modeling, masked language modeling, and image classification

However, I am working on differents tasks. Could you confirm if there are any plans to support additional transformer types such as AutoModelForVision2Seq or similar models for image captioning?

Thank you for your assistance.

Jan 27 '25 11:01 thuytrang32

That is great to hear @thuytrang32! We are designing a new LLM Trainer as part of Kubeflow Trainer V2 effort: https://github.com/kubeflow/training-operator/issues/2401. cc @Electronic-Waste We are planning to design LLM runtimes with pre-created Trainer for which you can override parameters.

It would be nice if you could explain your use-case in the doc or in the issue, so we can keep it in mind.

Jan 27 '25 12:01 andreyvelich

Hi @andreyvelich ,

I don’t have a specific production use case at the moment—my primary goal is to test the Kubeflow pipeline's capacity to fine-tune generative AI models across different transformer types.

I have successfully tested your pipeline with a BERT model for text classification. However, I encountered an issue when trying to test the pipeline with AutoModelForCausalLM using the openwebtext, bookcorpus dataset. The HuggingFaceDatasetParams class does not support the trust_remote_code=True parameter, which is required to download certain datasets. I also tested with wikitext dataset and it didn't work either with error "Config name is missing"

Wikitext

openwebtext

bookcorpus

Jan 27 '25 13:01 thuytrang32

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 27 '25 15:04 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

May 17 '25 20:05 github-actions[bot]

/reopen

May 18 '25 03:05 Electronic-Waste

@Electronic-Waste: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

May 18 '25 03:05 google-oss-prow[bot]

Hi @thuytrang32! Please try to use the V2 examples for fine-tuning: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/question-answering/fine-tune-distilbert.ipynb It should have much more stable orchestration. /close

Aug 10 '25 22:08 andreyvelich

@andreyvelich: Closing this issue.

In response to this:

Hi @thuytrang32! Please try to use the V2 examples for fine-tuning: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/question-answering/fine-tune-distilbert.ipynb It should have much more stable orchestration. /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 10 '25 22:08 google-oss-prow[bot]