axolotl
axolotl copied to clipboard
hub_strategy + saves_per_epoch not pushing saves to hub
Please check that this issue hasn't been reported before.
- [X] I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
with hub_strategy: "every_save"
and saves_per_epoch: 2
, I expect the model to get pushed to the hub twice per epoch, once at the halfway point and once at the end.
Current behaviour
the model is indeed being saved locally twice per epoch, but it is not sending to the hub.
Steps to reproduce
configure:
hub_strategy: "every_save",
saves_per_epoch: 2
and start a training run.
Config yaml
base_model: andysalerno/mistral-sft-v3 model_type: AutoModelForCausalLM
load_in_8bit: true load_in_4bit: false strict: false
datasets:
- path: andysalerno/rainbowfish-v1 type: system_prompt: "" field_system: system field_instruction: input field_output: output format: "{instruction}" no_input_format: "{instruction}" dataset_prepared_path: last_run_prepared val_set_size: 0.005 output_dir: ./lora-out-rainbow7
adapter: lora lora_model_dir:
sequence_len: 2048 sample_packing: false # was true eval_sample_packing: false pad_to_sequence_len: false padding_side: left
lora_r: 64 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj
lora_modules_to_save:
- embed_tokens
- lm_head
wandb_project: axolotl wandb_entity: wandb_watch: wandb_name: wandb_log_model:
gradient_accumulation_steps: 4 micro_batch_size: 4 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-5
train_on_inputs: false group_by_length: false bf16: true fp16: tf32: false
gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false
early_stopping_patience: 3
local_rank: logging_steps: 1 xformers_attention: flash_attention: true
loss_watchdog_threshold: 5.0 loss_watchdog_patience: 3
hub_strategy: "every_save" hub_model_id: andysalerno/rainbowfish-v7
num_epochs: 2 warmup_steps: 100
warmup_ratio: 0.1
eval_steps: 200 eval_table_size: eval_table_max_new_tokens: 128
save_steps: 5
max_steps: 400
saves_per_epoch: 2
debug:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
bos_token: "<|im_start|>"
eos_token: "<|im_end|>"
unk_token: "
Possible solution
No response
Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
note, when I set save_steps: 5
and unset saves_per_epoch
, it did indeed push to the hub after 5 steps. I used this as a practice run to validate everything was working, then I unset save_steps
and went back to saves_per_epoch: 2
and started the real run. And at epoch 0.5, it saved locally, but did not push to hub. Not sure what will happen at epoch 1.00, but I hope it pushes, or I'm out another $80 runpod credits :(
Hey! I sometimes get this behavior. From my experience, it will Always push the last epoch/final model. However, it sometimes push the intermediate checkpoints despite setting it to always do so. Unfortunately, all we do is pass this config to HF Trainer, so it might be an issue upstream?