keypoint-detection icon indicating copy to clipboard operation
keypoint-detection copied to clipboard

DDP Fixes & optimizations

Open ExtReMLapin opened this issue 7 months ago • 5 comments

Issue : after each epoch the dataloaded was recreated, we know that because we can see the albumentations 1.4 version warning showing up again and again

i'm still having metrics error so this PR in a draft

ExtReMLapin avatar May 15 '25 15:05 ExtReMLapin

Help would be appreciated on this, stuck with AttributeError: 'ClassifiedKeypoint' object has no attribute 'numel'

ExtReMLapin avatar May 15 '25 17:05 ExtReMLapin

Possible fix, set full_state_update to true and implement in KeypointAPMetrics

    # Override sync() to manually handle custom object synchronization
    def sync(
        self,
        dist_sync_fn: Optional[Callable] = None,
        process_group: Optional[Any] = None,
        should_sync: bool = True,
        distributed_available: Optional[Callable] = None,
    ) -> None:
        if not should_sync or dist_sync_fn is None:
            return

        # Only sync the total_ground_truth_keypoints tensor
        self.total_ground_truth_keypoints = dist_sync_fn(self.total_ground_truth_keypoints, process_group=process_group)

ExtReMLapin avatar May 15 '25 17:05 ExtReMLapin

TODO : DDP optimization with unused args thing warning

ExtReMLapin avatar May 15 '25 17:05 ExtReMLapin

I never tried parallel training with this codebase, are you using the Pytorch Lightning Trainer for this?

Not sure how to add some unit tests for parallel training, that is something we should look into!

tlpss avatar May 16 '25 15:05 tlpss

Well ... I guess ? I just used --device(s) 3

ExtReMLapin avatar May 16 '25 16:05 ExtReMLapin