simpleT5 icon indicating copy to clipboard operation
simpleT5 copied to clipboard

How to resume training?

Open RK-BAKU opened this issue 2 years ago • 3 comments

Hi guys! Is it possible to continue training from specific checkpoint?

RK-BAKU avatar Oct 25 '22 09:10 RK-BAKU

This is important! Any help on this one?

mgh1 avatar Feb 20 '23 12:02 mgh1

@RK-BAKU @mgh1

Hi, for me I just load the model I saved and then keep training on the model:

model.load_model("t5", 'file/to/your/trained/model', use_gpu=True)

#the rest is all the same for training

MAX_EPOCHS = 3

torch.cuda.memory_summary(device=None, abbreviated=False) torch.utils.checkpoint

model.train(train_df=df[0:(int)(0.7TRAINNING_SIZE)], eval_df=df[(int)(0.7TRAINNING_SIZE):TRAINNING_SIZE], source_max_token_len=MAX_LEN, target_max_token_len=SUMMARY_LEN, batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

CherylChaoNYCU avatar May 21 '23 13:05 CherylChaoNYCU

@RK-BAKU @mgh1

Hi, for me I just load the model I saved and then keep training on the model:

model.load_model("t5", 'file/to/your/trained/model', use_gpu=True)

#the rest is all the same for training

MAX_EPOCHS = 3

torch.cuda.memory_summary(device=None, abbreviated=False) torch.utils.checkpoint

model.train(train_df=df[0:(int)(0.7_TRAINNING_SIZE)], eval_df=df[(int)(0.7_TRAINNING_SIZE):TRAINNING_SIZE], source_max_token_len=MAX_LEN, target_max_token_len=SUMMARY_LEN, batch_size=5, max_epochs=MAX_EPOCHS, outputdir='/content/gdrive/MyDrive/HW5_HL_gen/t5model',use_gpu=True)

How do you save the model?

Because there doesnt seems to be any save model.

kolaganisankar avatar Mar 30 '24 17:03 kolaganisankar