autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[BUG] 'autotrain llm' does not save checkpoints in the project folder

Open Mobinapournemat opened this issue 4 months ago • 5 comments

Prerequisites

  • [X] I have read the documentation.
  • [X] I have checked other issues for similar problems.

Backend

Local

Interface Used

UI

CLI Command

autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

UI Screenshots & Parameters

No response

Error Logs

Checkpoints are not saved

Additional Information

I'm fine-tuning "meta-llama/Llama-2-7b-hf" on a remote GPU using autotrain llm, but after completing the training, the checkpoints are not being saved. The project folder contains files like autotrain-data, adapter-config, tokenizer, training-params, etc., but the checkpoints are missing. Here's the content of training-params file: { "model": "meta-llama/Llama-2-7b-hf", "project_name": "ftLlama", "data_path": "ftLlama/autotrain-data", "train_split": "train", "valid_split": null, "add_eos_token": true, "block_size": 1024, "model_max_length": 2048, "padding": "right", "trainer": "sft", "use_flash_attention_2": false, "log": "none", "disable_gradient_checkpointing": false, "logging_steps": -1, "eval_strategy": "epoch", "save_total_limit": 1, "auto_find_batch_size": false, "mixed_precision": null, "lr": 0.0002, "epochs": 5, "batch_size": 12, "warmup_ratio": 0.1, "gradient_accumulation": 4, "optimizer": "adamw_torch", "scheduler": "linear", "weight_decay": 0.0, "max_grad_norm": 1.0, "seed": 42, "chat_template": null, "quantization": "int4", "target_modules": "all-linear", "merge_adapter": false, "peft": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "model_ref": null, "dpo_beta": 0.1, "max_prompt_length": 128, "max_completion_length": null, "prompt_text_column": "autotrain_prompt", "text_column": "autotrain_text", "rejected_text_column": "autotrain_rejected_text", "push_to_hub": false, "unsloth": false, "distributed_backend": null } Am I missing something or is this a bug? This is the command I'm using: autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft

Mobinapournemat avatar Sep 30 '24 14:09 Mobinapournemat