ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Add knob (IGNITE_DISABLE_DISTRIBUTED_METRICS=1) to disable distributed metrics reduction

Open iXce opened this issue 2 years ago • 5 comments

This is useful for setups where distributed training is used, but evaluation is only performed on a single node (or independently over multiple nodes).

Description: In some setups one may want to have two different fleets for distributed training and (distributed) validation, however ignite currently conflates the two world sizes. Using this PR as a RFC before adding tests and extra documentation (if needed).

Check list:

  • [?] New tests are added (if a new feature is added)
  • [?] New doc strings: description and/or example code are in RST format
  • [?] Documentation is updated (if required)

iXce avatar Mar 16 '23 19:03 iXce

Thanks @iXce for the suggestion. We could also add an boolean attribute to Metric which all metrics descend from, called compute_per_rank and give it to the constructor of a metric. But could you please explain why do you need to evaluate each rank separately?

sadra-barikbin avatar Mar 16 '23 21:03 sadra-barikbin

Thanks @iXce for the suggestion. We could also add an boolean attribute to Metric which all metrics descend from, called compute_per_rank and give it to the constructor of a metric. But could you please explain why do you need to evaluate each rank separately?

The most typical use case would be that we run distributed training but only run validation on the chief worker (because the validation set is small, or uses a different procedure than training that doesn't necessarily work distributed training).

I was a bit puzzled initially to see that the ignite metrics automatically react to the usage of distributed training (practically speaking we were seeing them hang as they were waiting to allreduce() but none of the other workers would participate).

Regarding where to place the knob for this, it feels like a tough question: in a sense, one might want to be able to switch from single-worker to distributed-validation&metrics configuration without having to reconfigure the metrics, I think? So a constructor parameter to Metric may be somewhat cumbersome?

iXce avatar Mar 16 '23 22:03 iXce

@iXce thanks for the RFC! We also wanted to be able to override metrics decorators responsible for reducing data: https://github.com/pytorch/ignite/issues/1288. So, you could override it and do a no-op for your use cases. Or provide a DDP subgroup and reduce over a subgroup. What do you think ?

vfdev-5 avatar Mar 16 '23 23:03 vfdev-5

@iXce thanks for the RFC! We also wanted to be able to override metrics decorators responsible for reducing data: #1288. So, you could override it and do a no-op for your use cases. Or provide a DDP subgroup and reduce over a subgroup. What do you think ?

Oh yeah that looks super relevant indeed!

iXce avatar Mar 17 '23 00:03 iXce

@iXce would you like to implement this feature as suggested in #1288 ? I'll update the issue with an API proposal and we can iterate over. If you would like you can join our discord server and we can talk in a more fluent way about this. Let us know what do you think. Thanks!

vfdev-5 avatar Mar 21 '23 20:03 vfdev-5