Training instability and non-convergence with TAPIR on kubric-e dataset across multiple GPUs
I'm experiencing issues training the TAPIR model on the kubric-e dataset across different GPU configurations with PyTorch lightning (both checkpoint and non-checkpoint model). The losses either don't converge or show high instability, even with gradient clipping.
Dataset: kubric-e (tried training on validation dataset)
Observed Issues Loss Instability: When training from scratch: Losses fluctuate between 5 and 100+ With checkpointed model: Losses fluctuate between 0.5 and 10+
Different GPU/Batch Size Configurations Tested: 2 GPUs with batch size 2 (2 x 24) 1 GPU with batch size 4 (1 x 48) 1 GPU with batch size 2 (1 x 48)
train_tapir.py:
if __name__ == '__main__':
checkpoint_state_dict = torch.load("/scratch/shared/beegfs/shivanim/tapir/tapir_checkpoint_panning.pt")
model = tapir_model.TAPIR()
model.load_state_dict(checkpoint_state_dict, strict=False)
#model = model.to(torch.device('cuda'))
data_root = os.path.join("", "/scratch/shared/beegfs/shivanim/tapir/tapvid-kinetics/pickle_folder/")
train_dataset = TapVidDataset(dataset_type="kinetics", data_root=data_root)
train_loader = DataLoader(
train_dataset,
batch_size=1,
shuffle=True,
num_workers=0,
pin_memory=True,
collate_fn=collate_fn,
drop_last=True,
)
trainer = Trainer(accelerator='gpu')
trainer.fit(model, train_loader)
from torch.optim import Adam
optimizer = Adam(model().parameters(), lr=1e-3)
tapir_model.py:
def training_step(self, batch, batch_idx):
if batch.video == []:
return None
frames = batch.video
#TODO: ask what is this getting?
query_points = batch.query_points
gt_occluded = batch.occluded.to(float)#.cuda()
gt_target_points = batch.target_points
shape = frames.shape
outputs = self.forward(frames, query_points)
if outputs is None:
return None
tracks, occlusions, expected_dist = outputs['tracks'][0], outputs['occlusion'][0], outputs['expected_dist'][0]
loss_huber, loss_occ, loss_prob = model_utils.tapnet_loss(tracks, occlusions, gt_target_points, gt_occluded, shape)
# print("loss_huber", loss_huber)
# print("loss_occ", loss_occ)
# print("loss_prob", loss_prob)
visibles = self.postprocess_occlusions(occlusions, expected_dist).detach().cpu().numpy()
tracks = tracks.detach().cpu().numpy()
loss = loss_huber + loss_occ + loss_prob
# Print losses every 50 steps
if batch_idx % 50 == 0:
print(f"Batch {batch_idx} - Huber Loss: {loss_huber:.4f}, Occlusion Loss: {loss_occ:.4f}, Prob Loss: {loss_prob:.4f}, Total Loss: {loss:.4f}")
return loss
#return {"tracks":tracks, "visibles":visibles, "loss":loss_huber, "loss_occ":loss_occ, "loss_prob":loss_prob}
tap_vid_datasets.py (adapted from co-tracker): https://colab.research.google.com/drive/1VQdpphxTEjUE7F-zYdBv5Ta_2NVH-45F?usp=sharing
Hi, one quick try would be to add loss to all intermediate outputs as well. Since Tapir is an iterative refinement method, every single step output is supervised to get close to ground truth. Please see the paper for details. Thanks.