CosyVoice icon indicating copy to clipboard operation
CosyVoice copied to clipboard

Restart training from a checkpoint, with steps, etc

Open rlenain opened this issue 1 year ago • 12 comments

Hello,

I was wondering whether there was an easy to restart training from a checkpoint, resuming the steps, epochs, optimizer states, etc. This is in case training dies, but we want to restart training not from epoch 0.

Thanks

rlenain avatar Aug 13 '24 12:08 rlenain

specify --checkout, but the step epoch will start from 0

aluminumbox avatar Aug 16 '24 02:08 aluminumbox

do you mean --checkpoint? I've tried that, but it doesn't work. I think the LR scheduler being different makes the training loss go kind of crazy and in the end actually not returning to the same place

rlenain avatar Aug 16 '24 09:08 rlenain

do you mean --checkpoint? I've tried that, but it doesn't work. I think the LR scheduler being different makes the training loss go kind of crazy and in the end actually not returning to the same place

Hello, have you solved this problem? I also specified --checkpoint, and the learning rate of training also started from 0

CriDora avatar Aug 26 '24 12:08 CriDora

I think the author will update the code for this issue, however I have a temporary solution you can refer to.

  1. When starting a new training session, you need to change the path of the last checkpoint in run.sh. Example: --checkpoint CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice/llm/torch_ddp/epoch_2_whole.pt \

  2. Edit a little code in the file CosyVoice/cosyvoice/bin/train.py

    # Save init checkpoints
    info_dict = deepcopy(configs['train_conf'])
    save_model(model, 'init', info_dict)

    current_epoch = info_dict['current_epoch'] # add
    current_step = info_dict['current_step']       # add

    # Get executor
    executor = Executor()

    # Start training loop
    for epoch in range(current_epoch, info_dict['max_epoch']): # change
        executor.epoch = epoch
        executor.step = current_step # add
        train_dataset.set_epoch(epoch)
        dist.barrier()
        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
        executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
        dist.destroy_process_group(group_join)

  1. Add config in file: CosyVoice/examples/libritts/cosyvoice/conf/cosyvoice.yaml

Note: current_epoch and current_step correspond to the checkpoint you change in the run.sh file

Example:

train_conf:
    optim: adam
    optim_conf:
        lr: 0.001 # change to 1e-5 during sft
    scheduler: warmuplr # change to constantlr during sft
    scheduler_conf:
        warmup_steps: 2500
    max_epoch: 200
    grad_clip: 5
    accum_grad: 2
    log_interval: 100
    save_per_step: 1500
    current_epoch: 2    # add
    current_step: 1311 # add

drlor2k avatar Aug 29 '24 23:08 drlor2k

I think the author will update the code for this issue, however I have a temporary solution you can refer to.

  1. When starting a new training session, you need to change the path of the last checkpoint in run.sh. Example: --checkpoint CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice/llm/torch_ddp/epoch_2_whole.pt \
  2. Edit a little code in the file CosyVoice/cosyvoice/bin/train.py
    # Save init checkpoints
    info_dict = deepcopy(configs['train_conf'])
    save_model(model, 'init', info_dict)

    current_epoch = info_dict['current_epoch'] # add
    current_step = info_dict['current_step']       # add

    # Get executor
    executor = Executor()

    # Start training loop
    for epoch in range(current_epoch, info_dict['max_epoch']): # change
        executor.epoch = epoch
        executor.step = current_step # add
        train_dataset.set_epoch(epoch)
        dist.barrier()
        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
        executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
        dist.destroy_process_group(group_join)
  1. Add config in file: CosyVoice/examples/libritts/cosyvoice/conf/cosyvoice.yaml

Note: current_epoch and current_step correspond to the checkpoint you change in the run.sh file

Example:

train_conf:
    optim: adam
    optim_conf:
        lr: 0.001 # change to 1e-5 during sft
    scheduler: warmuplr # change to constantlr during sft
    scheduler_conf:
        warmup_steps: 2500
    max_epoch: 200
    grad_clip: 5
    accum_grad: 2
    log_interval: 100
    save_per_step: 1500
    current_epoch: 2    # add
    current_step: 1311 # add

Thank you, after modifying the code you provided, the checkpoint can be loaded normally.

CriDora avatar Aug 31 '24 04:08 CriDora

I thought the above code would make step go back to current_step when starting a new epoch. I modified it a bit.

    # Save init checkpoints
    info_dict = deepcopy(configs['train_conf'])
    save_model(model, 'init', info_dict)

    # ADD
    current_epoch = info_dict['current_epoch']
    current_step  = info_dict['current_step']
    start_session = True

    # Get executor
    executor = Executor()

    # Start training loop
    for epoch in range(current_epoch, info_dict['max_epoch']): # change
        executor.epoch = epoch
        if start_session:
            executor.step = current_step  # add
        train_dataset.set_epoch(epoch)
        dist.barrier()
        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
        executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
        dist.destroy_process_group(group_join)
        start_session = False

drlor2k avatar Sep 01 '24 01:09 drlor2k

I thought the above code would make step go back to current_step when starting a new epoch. I modified it a bit.

    # Save init checkpoints
    info_dict = deepcopy(configs['train_conf'])
    save_model(model, 'init', info_dict)

    # ADD
    current_epoch = info_dict['current_epoch']
    current_step  = info_dict['current_step']
    start_session = True

    # Get executor
    executor = Executor()

    # Start training loop
    for epoch in range(current_epoch, info_dict['max_epoch']): # change
        executor.epoch = epoch
        if start_session:
            executor.step = current_step  # add
        train_dataset.set_epoch(epoch)
        dist.barrier()
        group_join = dist.new_group(backend="gloo", timeout=datetime.timedelta(seconds=args.timeout))
        executor.train_one_epoc(model, optimizer, scheduler, train_data_loader, cv_data_loader, writer, info_dict, group_join)
        dist.destroy_process_group(group_join)
        start_session = False

Thanks, you are right. Do I need to modify the current_epoch and current_step values ​​in cosyvoice.fromscratch.yaml every time I resume training from a checkpoint?

CriDora avatar Sep 02 '24 09:09 CriDora

Yes, you need to modify the current_epoch and current_step values ​​in cosyvoice.fromscratch.yaml every time you resume training from a checkpoint.

You can see current_epoch and current_step in the checkpoint filename. Example: epoch_1_step_33000.pt

drlor2k avatar Sep 02 '24 12:09 drlor2k

Yes, you need to modify the current_epoch and current_step values ​​in cosyvoice.fromscratch.yaml every time you resume training from a checkpoint.

You can see current_epoch and current_step in the checkpoint filename. Example: epoch_1_step_33000.pt

Hello, sorry to bother you again. I want to confirm whether the learning rate will drop normally after restoring the checkpoint in your code above, because after I reply according to your code, although the number of restored steps is correct, the learning rate still starts to warm up again.

CriDora avatar Sep 15 '24 14:09 CriDora

Is there any update? I noticed that in the warmup code, there is a code like step = self.last_epoch

JohnHerry avatar May 20 '25 01:05 JohnHerry

It seems that cosyvoice training pipeline do not support resume trainning. you can specify a pretrained with --pretrain, but the training arguments, the warmup training steps, the initial learning rate, will be reset again, and those kept in you pretrained checkpoint will not be used continually.

JohnHerry avatar May 20 '25 01:05 JohnHerry

It did not save params from optimizer. yet

JohnHerry avatar May 20 '25 02:05 JohnHerry