transformers
transformers copied to clipboard
push_to_hub doesn't push checkpoint folder while training
System Info
I am using Google Colab with Unsloth Mistral notebook.
Reproduction
I am using this snippet
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = max_seq_length,
dataset_num_proc = 2,
packing = False, # Can make training 5x faster for short sequences.
args = TrainingArguments(
report_to="wandb",
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
max_steps = 250,
learning_rate = 2e-5,
fp16 = not torch.cuda.is_bf16_supported(),
bf16 = torch.cuda.is_bf16_supported(),
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.01,
lr_scheduler_type = "linear",
seed = 3407,
output_dir = "drive/MyDrive/Unsloth/mistral/outputs-mistral-sharegpt90k",
num_train_epochs=1,
save_strategy = "steps",
save_steps = 10,
push_to_hub=True,
hub_model_id="mistral-sharegpt90k",
hub_token=hftoken,
hub_strategy="all_checkpoints",
hub_private_repo=True
),
)
All works well but the push_to_hub
doesn't seem to push checkpoints folder to model repository. Other than that the model was pushed successfully
Expected behavior
It should push checkpoint folder and sync with Output_dirs
cc @younesbelkada
hmm that shouldn't be the case, can you for now call trainer.model.push_to_hub()
? Does the issue persist with non-unsloth models?
@younesbelkada Do you mean use trainer.model.push_to_hub() after training? Please provide example.
I am trying to push the checkpoints folder while training in progress.
I can try with other model. Will keep you posted.
Sorry I thought you meant after training, in that case your snippet looks correct. Yes please let us know as soon as you have some progresses so that we can isolate the bug from transformers / unsloth
Try executing a cell above prior to training to auth to huggingface.
from huggingface_hub import login
login()
@haydenbspence That shouldn't be the case tho, I already have added hub_token=hftoken,
hftoken being fetch from secret in google colab and it's working because I can push the model. Just not the checkpoints folder while training.
@younesbelkada I tried with teknium/OpenHermes-2.5-Mistral-7B
instead of unsloth model, It still doesn't push checkpoint-10
or checkpoint-20
folder(I set save_steps = 10,
in this case) while training.
Here is the notebook you can try.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Any updates? Why is it happening?
cc @SunMarc I don't know if you're the best person to pick this up after @younesbelkada ?
I'll have a look @amyeroberts !
Hi @pacozaa and @kdcyberdude, could you try to pass hub_always_push = True
in TrainingArguments
? It is strange no checkpoint were uploaded at all. When we are already trying to push a commit, we will skip the checkpoint saving unless hub_always_push
is True
. LMK if this solves your issue !
Hello! I'm also facing this issue with auto pushing from the SFTTrainer
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
dataset_text_field="text",
max_seq_length=max_seq_length,
dataset_num_proc=2,
packing=False,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=5,
# max_steps=60,
num_train_epochs=2,
learning_rate=2e-4,
fp16=not is_bfloat16_supported(),
bf16=is_bfloat16_supported(),
logging_steps=1,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir=OUTPUT_DIR,
report_to="wandb",
save_strategy="epoch",
push_to_hub=True,
hub_strategy="all_checkpoints",
hub_token=HUGGINGFACE_TOKEN,
hub_private_repo=True,
hub_model_id=f"{HUGGINGFACE_REPO}-lora-autosave", # HUGGINGFACE_REPO = user/base-model-chat-finetuned
hub_always_push=True,
),
)
Pushing at the end works though with
lora_repo = f"{HUGGINGFACE_REPO}-lora"
print(f"Pushing lora to {lora_repo}.")
model.push_to_hub(
lora_repo,
token=HUGGINGFACE_TOKEN,
private=True,
)
tokenizer.push_to_hub(
lora_repo,
token=HUGGINGFACE_TOKEN,
private=True,
)
But on my end it actually raises the following:
huggingface_hub.utils._errors.HfHubHTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/api/repos/create (Request ID: Root=1-blabla Invalid username or password.
- and this is weird 'cause the regular push_to_hub
works properly.
Have you solved it?
@SunMarc I think @psyb0t answered the question. Anything else you d like me to try?
I believe this is solve as I run my script this time it is working.