transformers push_to_hub doesn't push checkpoint folder while training

System Info

I am using Google Colab with Unsloth Mistral notebook.

Reproduction

I am using this snippet

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        report_to="wandb",
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 250,
        learning_rate = 2e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "drive/MyDrive/Unsloth/mistral/outputs-mistral-sharegpt90k",
        num_train_epochs=1,
        save_strategy = "steps",
        save_steps = 10,
        push_to_hub=True,
        hub_model_id="mistral-sharegpt90k",
        hub_token=hftoken,
        hub_strategy="all_checkpoints",
        hub_private_repo=True
    ),
)

All works well but the push_to_hub doesn't seem to push checkpoints folder to model repository. Other than that the model was pushed successfully

Expected behavior

It should push checkpoint folder and sync with Output_dirs

Apr 09 '24 09:04 pacozaa

cc @younesbelkada

Apr 09 '24 10:04 amyeroberts

hmm that shouldn't be the case, can you for now call trainer.model.push_to_hub() ? Does the issue persist with non-unsloth models?

Apr 10 '24 10:04 younesbelkada

@younesbelkada Do you mean use trainer.model.push_to_hub() after training? Please provide example.

I am trying to push the checkpoints folder while training in progress.

I can try with other model. Will keep you posted.

Apr 10 '24 12:04 pacozaa

Sorry I thought you meant after training, in that case your snippet looks correct. Yes please let us know as soon as you have some progresses so that we can isolate the bug from transformers / unsloth

Apr 10 '24 12:04 younesbelkada

Try executing a cell above prior to training to auth to huggingface.

from huggingface_hub import login

login()

Apr 10 '24 21:04 haydenbspence

@haydenbspence That shouldn't be the case tho, I already have added hub_token=hftoken, hftoken being fetch from secret in google colab and it's working because I can push the model. Just not the checkpoints folder while training.

Apr 11 '24 02:04 pacozaa

@younesbelkada I tried with teknium/OpenHermes-2.5-Mistral-7B instead of unsloth model, It still doesn't push checkpoint-10 or checkpoint-20 folder(I set save_steps = 10, in this case) while training.

Here is the notebook you can try.

Apr 11 '24 07:04 pacozaa

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 10 '24 08:05 github-actions[bot]

Any updates? Why is it happening?

Jun 27 '24 16:06 kdcyberdude

cc @SunMarc I don't know if you're the best person to pick this up after @younesbelkada ?

Jun 27 '24 16:06 amyeroberts

I'll have a look @amyeroberts !

Jun 28 '24 15:06 SunMarc

Hi @pacozaa and @kdcyberdude, could you try to pass hub_always_push = True in TrainingArguments ? It is strange no checkpoint were uploaded at all. When we are already trying to push a commit, we will skip the checkpoint saving unless hub_always_push is True. LMK if this solves your issue !

Jun 28 '24 18:06 SunMarc

Hello! I'm also facing this issue with auto pushing from the SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    packing=False,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        # max_steps=60,
        num_train_epochs=2,
        learning_rate=2e-4,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir=OUTPUT_DIR,
        report_to="wandb",
        save_strategy="epoch",
        push_to_hub=True,
        hub_strategy="all_checkpoints",
        hub_token=HUGGINGFACE_TOKEN,
        hub_private_repo=True,
        hub_model_id=f"{HUGGINGFACE_REPO}-lora-autosave", # HUGGINGFACE_REPO = user/base-model-chat-finetuned
        hub_always_push=True,
    ),
)

Pushing at the end works though with

lora_repo = f"{HUGGINGFACE_REPO}-lora"

print(f"Pushing lora to {lora_repo}.")
model.push_to_hub(
    lora_repo,
    token=HUGGINGFACE_TOKEN,
    private=True,
)
tokenizer.push_to_hub(
    lora_repo,
    token=HUGGINGFACE_TOKEN,
    private=True,
)

But on my end it actually raises the following: huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-blabla Invalid username or password. - and this is weird 'cause the regular push_to_hub works properly.

Jun 30 '24 23:06 psyb0t

Have you solved it?

Aug 07 '24 18:08 BogdanTurbal

@SunMarc I think @psyb0t answered the question. Anything else you d like me to try?

Aug 20 '24 15:08 pacozaa

I believe this is solve as I run my script this time it is working.

Sep 02 '24 09:09 pacozaa

transformers transformers copied to clipboard

push_to_hub doesn't push checkpoint folder while training

System Info

Reproduction

Expected behavior

transformers
transformers copied to clipboard