open_clip
open_clip copied to clipboard
Support distributed evaluation
Currently, evaluation is done on rank zero. This PR provides support for distributed evaluation (using an optional --distributed-evaluation
argument) to make evaluation faster (supports both zero-shot and retrieval).
this LGTM
could you please double check @mitchellnw or @rwightman ?
In the pytorch imagenet example for distributed imagenet eval they have an aux_val_loader
to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402
In the pytorch imagenet example for distributed imagenet eval they have an
aux_val_loader
to handle the case where the test set size is not divisible by num_gpus - do we need to have this? https://github.com/pytorch/examples/blob/main/imagenet/main.py#L396-L402
Thanks for the link, not sure why they do drop_last=True on val_loader (not used here), probably to avoid having a GPU worker with much fewer examples than the others? so rather, they seem to do drop_last, and compute the last few examples validation performance in all GPU workers.
@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223
@mehdidc I think this is actually necessary or else you can get different val perf when different numbers of gpus are used, e.g., see this comment: https://github.com/facebookresearch/deit/blob/main/main.py#L221-L223
I see, thanks @mitchellnw! OK so I need to fix this. I really thought that DistributedSampler
with drop_last=False
would do the the "right" thing (although wouldn't be ideal for disributed setting if it would be case) in the sense of seing each example exactly once (even when dataset size is not divisible by nb of workers, e.g. the remaining examples can be distributed to a subset of workers)
Is this argument "--distributed_evaluation" not available in the current version?
@dmlpt Not yet, I still need to fix the val dataloader like @mitchellnw mentions and rebase on master