DDP Fixes & optimizations
Issue : after each epoch the dataloaded was recreated, we know that because we can see the albumentations 1.4 version warning showing up again and again
i'm still having metrics error so this PR in a draft
Help would be appreciated on this, stuck with AttributeError: 'ClassifiedKeypoint' object has no attribute 'numel'
Possible fix, set full_state_update to true and implement in KeypointAPMetrics
# Override sync() to manually handle custom object synchronization
def sync(
self,
dist_sync_fn: Optional[Callable] = None,
process_group: Optional[Any] = None,
should_sync: bool = True,
distributed_available: Optional[Callable] = None,
) -> None:
if not should_sync or dist_sync_fn is None:
return
# Only sync the total_ground_truth_keypoints tensor
self.total_ground_truth_keypoints = dist_sync_fn(self.total_ground_truth_keypoints, process_group=process_group)
TODO : DDP optimization with unused args thing warning
I never tried parallel training with this codebase, are you using the Pytorch Lightning Trainer for this?
Not sure how to add some unit tests for parallel training, that is something we should look into!
Well ... I guess ? I just used --device(s) 3