helen ngo comments

Results 27 comments of


                                            helen ngo

Cache results from `evaluator` and implement data canaries for reproducibility

Thanks @lvwerra! * re: `scores` — I found out the same thing when I started looking at implementing this for the QA evaluator; the input/output signatures for various evaluators can...

Cache results from `evaluator` and implement data canaries for reproducibility

re: `scores`, I see your point re: a classifier returning only predicted _labels_ without predictions could plausibly output the same predicted labels for all canary examples. Unfortunately it's not straightforward...

Cache results from `evaluator` and implement data canaries for reproducibility

Hi @lhoestq! @lvwerra suggested borrowing the custom Pickler from `datasets` as an alternative to this approach if we are worried about canary collisions, since it was mentioned that the `datasets`...

Cache results from `evaluator` and implement data canaries for reproducibility

Closing for now since pipe objects can't generically pickled (non-deterministic containers/ops/etc) and we haven't figured out what to do about the possibility of collisions in data canaries.

Automatically choose dataset split if none provided

Hi @ola13, I actually put the original call to the parent class back in and returned the Dataset object instead, since I noticed that both the QA and token classification...

Automatically choose dataset split if none provided

Getting the strangest error in CI here: `TestQuestionAnsweringEvaluator.test_model_init` is failing in pytest with `AssertionError: 33.333333333333336 != 0` but running the test suite in debug mode clearly shows the correct test...

`evaluate.load` not loading all files in the repo

Seconding that this seems like it'd be useful! I recently tried to load a custom metric which opened up a JSON file for configuration, but the JSON file wasn't loaded....

Evaluation harness

Thanks @lvwerra 🌊 * Skipping the config sounds fine to me. I had modelled this after the Transformers `return AutoConfig.from_pretrained(*args, **kwargs)` logic but am also happy to init the Harness...

Evaluation harness

Merging with main resulted in a ton of extra commits from the merge commits which has made the history on this branch really difficult to read/rebase so I'll likely redo...

Change perplexity to be calculated with base e

A comparison, for reference, on the sentence `['Hugging Face is a startup based in New York City and Paris']` Previously, base 2: ``` import evaluate perplexity = evaluate.load("perplexity", module_type="metric") input_texts...