helen ngo

Results 16 issues of helen ngo

This is a proposed refactor to the `perplexity` metric which would bring `perplexity` closer to the other metrics in `evaluate`, which generally do not run inference in their `compute` functions,...

Merging with the open docs PR for perplexity, #238. Closes #241.

Previously, `evaluator.compute(..., data='imdb', ....)` would fail because it was returning an object of type `dataset.DatasetDict`. This automatically detects a split if none is given (i.e. user passes in the dataset...

This widget seems like it'd be useful for demonstration purposes but right now I'm unclear if it's broken or incomplete. I assume the rows in the columns data (measurement) and...

Currently the `perplexity` metric and measurement both instantiate an entire model object within the `_compute()` function and run inference, which breaks the pattern where only predictions, references, and other metadata...

Caching results from the Evaluator requires checking uniqueness of results against a (model_or_pipeline, dataset, evaluation module) tuple. We can version datasets by accessing their `.fingerprint` attribute, and evaluation modules by...

In addition to the current task types available in the Evaluator we want a generic text generation pipeline which runs inference and returns generations. The "data" the evaluator will take...

Here's a rough proposal for an evaluation harness interface, where users pass in a JSON file which configures the evaluator and "tasks", made up of a dataset, metric, and other...

Closes #296, which hopefully results in fewer broken Spaces. Nothing fancy about this implementation and it's pretty specific to Hub metric card formats but works just fine for what we...