open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

Support distributed evaluation

Open mehdidc opened this issue 1 year ago • 7 comments

Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional --distributed-evaluation argument) to make evaluation faster (supports both zero-shot and retrieval).

mehdidc avatar Sep 24 '22 17:09 mehdidc

this LGTM

could you please double check @mitchellnw or @rwightman ?

rom1504 avatar Nov 07 '22 14:11 rom1504

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

mitchellnw avatar Nov 07 '22 19:11 mitchellnw

In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402

Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers.

mehdidc avatar Nov 14 '22 09:11 mehdidc

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

mitchellnw avatar Nov 14 '22 23:11 mitchellnw

@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223

I see, thanks @mitchellnw! OK so I need to fix this. I really thought that DistributedSampler with drop_last=False would do the the "right" thing (although wouldn't be ideal for disributed setting if it would be case) in the sense of seing each example exactly once (even when dataset size is not divisible by nb of workers, e.g. the remaining examples can be distributed to a subset of workers)

mehdidc avatar Nov 15 '22 00:11 mehdidc

Is this argument "--distributed_evaluation" not available in the current version?

dmlpt avatar Feb 15 '23 02:02 dmlpt

@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master

mehdidc avatar Feb 19 '23 15:02 mehdidc