helen ngo issues

Results 16 issues of


                                            helen ngo

Refactor perplexity so compute() does not run inference by default

This is a proposed refactor to the `perplexity` metric which would bring `perplexity` closer to the other metrics in `evaluate`, which generally do not run inference in their `compute` functions,...

Change perplexity to be calculated with base e

Merging with the open docs PR for perplexity, #238. Closes #241.

Add bits-per-word metric

metric request

Automatically choose dataset split if none provided

Previously, `evaluator.compute(..., data='imdb', ....)` would fail because it was returning an object of type `dataset.DatasetDict`. This automatically detects a split if none is given (i.e. user passes in the dataset...

Fix gradio widgets on the perplexity metric/measurement spaces

This widget seems like it'd be useful for demonstration purposes but right now I'm unclear if it's broken or incomplete. I assume the rows in the columns data (measurement) and...

Refactor perplexity implementations to be usable with evaluators

Currently the `perplexity` metric and measurement both instantiate an entire model object within the `_compute()` function and run inference, which breaks the pattern where only predictions, references, and other metadata...

Cache results from `evaluator` and implement data canaries for reproducibility

Caching results from the Evaluator requires checking uniqueness of results against a (model_or_pipeline, dataset, evaluation module) tuple. We can version datasets by accessing their `.fingerprint` attribute, and evaluation modules by...

Implement "text generation" task in the Evaluator

In addition to the current task types available in the Evaluator we want a generic text generation pipeline which runs inference and returns generations. The "data" the evaluator will take...

Evaluation harness

Here's a rough proposal for an evaluation harness interface, where users pass in a JSON file which configures the evaluator and "tasks", made up of a dataset, metric, and other...

Test for valid YAML files

Closes #296, which hopefully results in fewer broken Spaces. Nothing fancy about this implementation and it's pretty specific to Hub metric card formats but works just fine for what we...