accelerate icon indicating copy to clipboard operation
accelerate copied to clipboard

accelerate with wandb tracker not logging

Open GeethaGopinath opened this issue 2 years ago • 1 comments

System Info

accelerate version: 0.19.0
numpy version: 1.23.5
torch version: 1.12.1+cu102
python version: 3.8.16
OS: Ubuntu

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
  • [X] My own task or dataset (give details below)

Reproduction

During the training of my models, I would like to log both step-level and epoch-level loss and metrics. But only the first line is working but the other logging lines don't seem to work. Please find below the code

`for epoch in range(starting_epoch, num_epochs): model.train() total_loss = 0

    for step, batch in enumerate(train_dataloader):
        if resume_from_checkpoint and epoch == starting_epoch:
            if resume_step is not None and step < resume_step:
                progress_bar.update(1)
                completed_steps += 1
                continue
        
        with accelerator.accumulate(model):
            outputs = model(**batch)
            loss = outputs.loss
            # We keep track of the loss at each epoch
            total_loss += loss.detach().float()
            accelerator.backward(loss)
        
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()
            
        if accelerator.sync_gradients:
            progress_bar.update(1)
            completed_steps += 1
        
        
        
        accelerator.log({"batch_train_loss": loss},step=completed_steps)

        if isinstance(checkpointing_steps, int):
            output_dir_step = f"step_{completed_steps}"
            if completed_steps % checkpointing_steps == 0:
                if output_dir is not None:
                    output_dir_step = os.path.join(output_dir, output_dir_step)
                    if not os.path.exists(output_dir_step):
                        os.makedirs(output_dir_step)
                accelerator.save_state(output_dir=output_dir_step)

    model.eval()
    losses = []
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            outputs = model(**batch)
        loss = outputs.loss
        completed_eval_steps += 1
        losses.append(accelerator.gather(loss.repeat(batch_size)))
        
        accelerator.log({"batch_eval_loss": loss},step=completed_eval_steps)
    
        

    losses = torch.cat(losses)
    try:
        eval_loss = torch.mean(losses)
        perplexity = math.exp(eval_loss)
    except OverflowError:
        perplexity = float("inf")

    accelerator.print(f">>> Epoch {epoch}: Perplexity: {perplexity}")
    
    accelerator.log(
            {
                "perplexity": perplexity,
                "eval_loss": eval_loss,
                "train_loss": total_loss.item() / len(train_dataloader),
                "epoch": epoch,
                "step": completed_steps,
            },
            step=completed_steps,
        )

    if checkpointing_steps == "epoch":
        output_dir_epoch = f"epoch_{epoch}"
        if output_dir is not None:
            output_dir_epoch = os.path.join(output_dir, output_dir_epoch)
            if not os.path.exists(output_dir_epoch):
                os.makedirs(output_dir_epoch)
        accelerator.save_state(output_dir=output_dir_epoch)
        
    

accelerator.end_training()`

image

All I am getting post the training completion is in the image above. Kindly help with resolving this issue.

Best

Expected behavior

I would like to visualise, train and eval loss at step and epoch level.

GeethaGopinath avatar May 12 '23 07:05 GeethaGopinath

This is probably because of the passed completed_eval_step. Try to see what happens without including the step parameter when you log

muellerzr avatar May 26 '23 17:05 muellerzr

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 20 '23 15:06 github-actions[bot]

I get the similar error, for some metrics, it logs, some don't get logged. I am also using the accelerator and wandb.

MovingKyu avatar Aug 08 '23 22:08 MovingKyu

I think accelerator has some issues with passing multiple dataloaders to accelerator.prepare.

MovingKyu avatar Aug 08 '23 23:08 MovingKyu

@MovingKyu can you open a new issue on this with a full reproducer, and answer the prompts given when opening a bug report? Thanks!

muellerzr avatar Aug 08 '23 23:08 muellerzr

@muellerzr Before I open a bug report, I find your solution of "removing the step parameter" actually works. It logs all the numbers, however, the "step" information is crucial in my case. Do you know how to log the parameters even with passing the step parameters?

MovingKyu avatar Aug 09 '23 05:08 MovingKyu

Me too...Has the problem been resolved?

wanghan0501 avatar Jan 05 '24 05:01 wanghan0501

You need to call .log() one more time after, it's a thing with wandb and syncing.

muellerzr avatar Jan 05 '24 15:01 muellerzr

@muellerzr can you kindly elaborate on how to do that?

lzy37ld avatar Jun 01 '24 03:06 lzy37ld

@lzy37ld just do accelerator.log() (You may not need to pass anything in)

muellerzr avatar Jun 06 '24 13:06 muellerzr