Training instability and non-convergence with TAPIR on kubric-e dataset across multiple GPUs

Open swarnim-j opened this issue 1 year ago • 1 comments

I'm experiencing issues training the TAPIR model on the kubric-e dataset across different GPU configurations with PyTorch lightning (both checkpoint and non-checkpoint model). The losses either don't converge or show high instability, even with gradient clipping.

Dataset: kubric-e (tried training on validation dataset)

Observed Issues Loss Instability: When training from scratch: Losses fluctuate between 5 and 100+ With checkpointed model: Losses fluctuate between 0.5 and 10+

Different GPU/Batch Size Configurations Tested: 2 GPUs with batch size 2 (2 x 24) 1 GPU with batch size 4 (1 x 48) 1 GPU with batch size 2 (1 x 48)

train_tapir.py:

if __name__ == '__main__':
    checkpoint_state_dict = torch.load("/scratch/shared/beegfs/shivanim/tapir/tapir_checkpoint_panning.pt")
    model = tapir_model.TAPIR()
    model.load_state_dict(checkpoint_state_dict, strict=False)
    #model = model.to(torch.device('cuda'))
    data_root = os.path.join("", "/scratch/shared/beegfs/shivanim/tapir/tapvid-kinetics/pickle_folder/")
    train_dataset = TapVidDataset(dataset_type="kinetics", data_root=data_root)
    train_loader = DataLoader(
         train_dataset,
         batch_size=1,
         shuffle=True,
         num_workers=0,
         pin_memory=True,
         collate_fn=collate_fn,
         drop_last=True,
    )
    trainer = Trainer(accelerator='gpu')
    trainer.fit(model, train_loader)
    from torch.optim import Adam
    optimizer = Adam(model().parameters(), lr=1e-3)

tapir_model.py:

def training_step(self, batch, batch_idx):
    if batch.video == []:
      return None
    frames = batch.video
    #TODO: ask what is this getting?
    query_points = batch.query_points
    gt_occluded = batch.occluded.to(float)#.cuda()
    gt_target_points = batch.target_points
    shape = frames.shape
    outputs = self.forward(frames, query_points)
    if outputs is None:
      return None 
    tracks, occlusions, expected_dist = outputs['tracks'][0], outputs['occlusion'][0], outputs['expected_dist'][0]
    loss_huber, loss_occ, loss_prob = model_utils.tapnet_loss(tracks, occlusions, gt_target_points, gt_occluded, shape)
    # print("loss_huber", loss_huber)
    # print("loss_occ", loss_occ)
    # print("loss_prob", loss_prob)
    visibles = self.postprocess_occlusions(occlusions, expected_dist).detach().cpu().numpy()
    tracks = tracks.detach().cpu().numpy()
    loss = loss_huber + loss_occ + loss_prob
    
    # Print losses every 50 steps
    if batch_idx % 50 == 0:
        print(f"Batch {batch_idx} - Huber Loss: {loss_huber:.4f}, Occlusion Loss: {loss_occ:.4f}, Prob Loss: {loss_prob:.4f}, Total Loss: {loss:.4f}")
    
    return loss
    #return {"tracks":tracks, "visibles":visibles, "loss":loss_huber, "loss_occ":loss_occ, "loss_prob":loss_prob}

tap_vid_datasets.py (adapted from co-tracker): https://colab.research.google.com/drive/1VQdpphxTEjUE7F-zYdBv5Ta_2NVH-45F?usp=sharing

Dec 13 '24 11:12 swarnim-j

Hi, one quick try would be to add loss to all intermediate outputs as well. Since Tapir is an iterative refinement method, every single step output is supervised to get close to ground truth. Please see the paper for details. Thanks.

Dec 13 '24 12:12 yangyi02