autotrain-advanced
autotrain-advanced copied to clipboard
[BUG] 'autotrain llm' does not save checkpoints in the project folder
Prerequisites
- [X] I have read the documentation.
- [X] I have checked other issues for similar problems.
Backend
Local
Interface Used
UI
CLI Command
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft
UI Screenshots & Parameters
No response
Error Logs
Checkpoints are not saved
Additional Information
I'm fine-tuning "meta-llama/Llama-2-7b-hf" on a remote GPU using autotrain llm, but after completing the training, the checkpoints are not being saved. The project folder contains files like autotrain-data, adapter-config, tokenizer, training-params, etc., but the checkpoints are missing.
Here's the content of training-params file:
{ "model": "meta-llama/Llama-2-7b-hf", "project_name": "ftLlama", "data_path": "ftLlama/autotrain-data", "train_split": "train", "valid_split": null, "add_eos_token": true, "block_size": 1024, "model_max_length": 2048, "padding": "right", "trainer": "sft", "use_flash_attention_2": false, "log": "none", "disable_gradient_checkpointing": false, "logging_steps": -1, "eval_strategy": "epoch", "save_total_limit": 1, "auto_find_batch_size": false, "mixed_precision": null, "lr": 0.0002, "epochs": 5, "batch_size": 12, "warmup_ratio": 0.1, "gradient_accumulation": 4, "optimizer": "adamw_torch", "scheduler": "linear", "weight_decay": 0.0, "max_grad_norm": 1.0, "seed": 42, "chat_template": null, "quantization": "int4", "target_modules": "all-linear", "merge_adapter": false, "peft": true, "lora_r": 16, "lora_alpha": 32, "lora_dropout": 0.05, "model_ref": null, "dpo_beta": 0.1, "max_prompt_length": 128, "max_completion_length": null, "prompt_text_column": "autotrain_prompt", "text_column": "autotrain_text", "rejected_text_column": "autotrain_rejected_text", "push_to_hub": false, "unsloth": false, "distributed_backend": null }
Am I missing something or is this a bug?
This is the command I'm using:
autotrain llm --train --project-name ftLlama --model meta-llama/Llama-2-7b-hf --data-path . --peft --lr 2e-4 --batch-size 12 --epochs 5 --trainer sft