Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

with hub_strategy: "every_save" and saves_per_epoch: 2, I expect the model to get pushed to the hub twice per epoch, once at the halfway point and once at the end.

Current behaviour

the model is indeed being saved locally twice per epoch, but it is not sending to the hub.

Steps to reproduce

configure:

hub_strategy: "every_save",
saves_per_epoch: 2

and start a training run.

Config yaml

base_model: andysalerno/mistral-sft-v3 model_type: AutoModelForCausalLM

load_in_8bit: true load_in_4bit: false strict: false

datasets:

path: andysalerno/rainbowfish-v1 type: system_prompt: "" field_system: system field_instruction: input field_output: output format: "{instruction}" no_input_format: "{instruction}" dataset_prepared_path: last_run_prepared val_set_size: 0.005 output_dir: ./lora-out-rainbow7

adapter: lora lora_model_dir:

sequence_len: 2048 sample_packing: false # was true eval_sample_packing: false pad_to_sequence_len: false padding_side: left

lora_r: 64 lora_alpha: 16 lora_dropout: 0.05 lora_target_linear: true lora_fan_in_fan_out: lora_target_modules:

gate_proj
down_proj
up_proj
q_proj
v_proj
k_proj
o_proj

lora_modules_to_save:

embed_tokens
lm_head

wandb_project: axolotl wandb_entity: wandb_watch: wandb_name: wandb_log_model:

gradient_accumulation_steps: 4 micro_batch_size: 4 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-5

train_on_inputs: false group_by_length: false bf16: true fp16: tf32: false

gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false

early_stopping_patience: 3

local_rank: logging_steps: 1 xformers_attention: flash_attention: true

loss_watchdog_threshold: 5.0 loss_watchdog_patience: 3

hub_strategy: "every_save" hub_model_id: andysalerno/rainbowfish-v7

num_epochs: 2 warmup_steps: 100

warmup_ratio: 0.1

eval_steps: 200 eval_table_size: eval_table_max_new_tokens: 128

save_steps: 5

max_steps: 400

saves_per_epoch: 2 debug: weight_decay: 0.1 fsdp: fsdp_config: special_tokens: bos_token: "<|im_start|>" eos_token: "<|im_end|>" unk_token: ""

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Feb 07 '24 22:02 andysalerno

note, when I set save_steps: 5 and unset saves_per_epoch, it did indeed push to the hub after 5 steps. I used this as a practice run to validate everything was working, then I unset save_steps and went back to saves_per_epoch: 2 and started the real run. And at epoch 0.5, it saved locally, but did not push to hub. Not sure what will happen at epoch 1.00, but I hope it pushes, or I'm out another $80 runpod credits :(

Feb 07 '24 22:02 andysalerno

Hey! I sometimes get this behavior. From my experience, it will Always push the last epoch/final model. However, it sometimes push the intermediate checkpoints despite setting it to always do so. Unfortunately, all we do is pass this config to HF Trainer, so it might be an issue upstream?

Feb 17 '24 03:02 NanoCode012

axolotl
axolotl copied to clipboard

hub_strategy + saves_per_epoch not pushing saves to hub

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

early_stopping_patience: 3

warmup_ratio: 0.1

save_steps: 5

max_steps: 400

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl axolotl copied to clipboard

hub_strategy + saves_per_epoch not pushing saves to hub

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

early_stopping_patience: 3

warmup_ratio: 0.1

save_steps: 5

max_steps: 400

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

axolotl
axolotl copied to clipboard