evaluate
evaluate copied to clipboard
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Currently, when you load a metric that loads the same metric twice, e.g., [chrF](https://huggingface.co/spaces/evaluate-metric/chrf/blob/main/chrf.py#L16-L18), the error message for the missing library will mention that library multiple times. For instance: >...
This works: ```python metric=evaluate.load('f1') metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], average=None) ``` This won't work: ```python metric=evaluate.combine(["f1"]) metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1,...
This widget seems like it'd be useful for demonstration purposes but right now I'm unclear if it's broken or incomplete. I assume the rows in the columns data (measurement) and...
This PR: * Refactors the docstrings to avoid duplicates in the `Evaluator` subclasses * Put all arguments in the `compute()` signature for subclasses of Evaluator, so as to be able...
Hi, thanks for your work on this project! I was surprised the see that your perplexity implementation uses the base two exponential. See https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/perplexity.py#L183 Is this indented or a bug?
Currently the `perplexity` metric and measurement both instantiate an entire model object within the `_compute()` function and run inference, which breaks the pattern where only predictions, references, and other metadata...
`poetry add evaluate` ``` Using version ^0.2.2 for evaluate Updating dependencies Resolving dependencies... (86.1s) Writing lock file Package operations: 15 installs, 0 updates, 0 removals • Installing frozenlist (1.3.1) •...
Caching results from the Evaluator requires checking uniqueness of results against a (model_or_pipeline, dataset, evaluation module) tuple. We can version datasets by accessing their `.fingerprint` attribute, and evaluation modules by...
Hi, I find the api in https://huggingface.co/metrics quite useful. I am playing around with video/image captioning task, where CIDEr is a popular metric. Do you plan to add this into...