fairseq2
fairseq2 copied to clipboard
Add hf metric wrapper
What does this PR do? Please describe:
This PR adds a wrapper to HuggingFace's evaluate.Metric to make it compatible to fairseq2.metrics APIs. This enables evaluating fairseq2 model on many downstream tasks available in HuggingFace, using standard evaluation metrics such as comet, bleu(t), bertscore, etc.
Small changes in fairseq2.metrics:
evaluate.Metric has internal support for multi-node evaluation, where the incremental metric updating code run in separate nodes (and writing results to temporary PyArrow tables), after that the final compute() is done on rank 0 only, pulling results from the tables. To adapt this to fairseq2.metrics.bag.sync_and_compute_metrics() , we introduce theauto_sync attribute in the MetricBag, that will turn on if it has HF metrics. In other words, we leave the synchronization to the underlying metrics and not in the bag itself.
One caveat is that we can not mix the HF and non-HF metrics within one bag, but put them in separate bags.
Fixes #{issue number}
Does your PR introduce any breaking changes? If yes, please list them: List of all backwards-incompatible changes.
Check list:
- [ ] Was the content of this PR discussed and approved via a GitHub issue? (no need for typos or documentation improvements)
- [x] Did you read the contributor guideline?
- [x] Did you make sure that your PR does only one thing instead of bundling different changes together?
- [x] Did you make sure to update the documentation with your changes? (if necessary)
- [x] Did you write any new necessary tests?
- [x] Did you verify new and existing tests pass locally with your changes?
- [ ] Did you update the CHANGELOG? (no need for typos, documentation, or minor internal changes)
I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics. I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like combine that acts like MetricBag. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining a Metric interface in fairseq2 (instead of using torcheval'a Metrics interface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.
I think we might need to reconsider the generic metrics API in fairseq2 before merging this PR. Right now it pretty much uses torcheval which apparently uses a different approach to syncing metrics.
I think the decision of binding fairseq2 to torcheval is a good one. It is a thin, low-level library to help developing further libraries on top. HG on the other hand targets ready-to-use use cases and it makes sense to hide the synchronization logic inside.
I have a fundamental question though. In your use case, how do you combine fairseq2 metrics and HG evaluate? Looks like HG has some similar APIs like
combinethat acts likeMetricBag. Which approach does make more sense in your opinion (1) using the two APIs in parallel in your code or (2) abstracting away HG under fairseq2.metrics (i.e. this PR)? If the latter, we might consider defining aMetricinterface in fairseq2 (instead of using torcheval'aMetricsinterface) and have derived interfaces for torcheval and HG that fit better to the corresponding libraries.
Good questions. Until now I always followed (1) to have a separate metrics API in evaluation tasks. It served well my use cases (pure evaluation), and now I am thinking of using the evaluation within the training loop. Having the same MetricBag API helps me "register" the states and continue later.
@cbalioglu Maybe we can indeed wait a bit for the use case to be clearer and gather more requirements before merging this PR.