open_clip Support distributed evaluation

Support distributed evaluation

Open mehdidc opened this issue 1 year ago • 7 comments

Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional --distributed-evaluation argument) to make evaluation faster (supports both zero-shot and retrieval).

Sep 24 '22 17:09 mehdidc

this LGTM

could you please double check @mitchellnw or @rwightman ?

Nov 07 '22 14:11 rom1504

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

Nov 07 '22 19:11 mitchellnw

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers.

Nov 14 '22 09:11 mehdidc

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

Nov 14 '22 23:11 mitchellnw

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

I see, thanks @mitchellnw! OK so I need to fix this. I really thought that DistributedSampler with drop_last=False would do the the "right" thing (although wouldn't be ideal for disributed setting if it would be case) in the sense of seing each example exactly once (even when dataset size is not divisible by nb of workers, e.g. the remaining examples can be distributed to a subset of workers)

Nov 15 '22 00:11 mehdidc

Is this argument "--distributed_evaluation" not available in the current version?

Feb 15 '23 02:02 dmlpt

@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master

Feb 19 '23 15:02 mehdidc

open_clip open_clip copied to clipboard

Support distributed evaluation

open_clip
open_clip copied to clipboard