evaluate Cache results from `evaluator` and implement data canaries for reproducibility

Caching results from the Evaluator requires checking uniqueness of results against a (model_or_pipeline, dataset, evaluation module) tuple.

We can version datasets by accessing their .fingerprint attribute, and evaluation modules by their ._hash. However, model_or_pipeline can be be any type of {str, Pipeline, Callable, PreTrainedModel, TFPreTrainedModel} and up until now there was no clear way to universally assert a unique fingerprint for all of these types of models. It's not possible to simply serialize and hash the pipe object created by the Evaluator in compute.prepare_pipeline because pipes can be built on top of non-serializable functions or use non-deterministic containers (e.g. dictionaries).

In order to "fingerprint" arbitrary model_or_pipeline objects we use a data canary ("canary" like "canary in the coal mine"), which is a small, known dataset (e.g. three sentences) fed into a newly-instantiated pipe in whatever format the pipe needs. Raw predictions for these canaries are serialized and hashed, and this "canary hash" is used to fingerprint the pipe. The key assumption here is that pipes are fully deterministic and will always produce the same prediction for some input (which is a fine assumption, because if the pipe produces non-deterministic results, you probably don't want to cache its results anyway).

After experimenting with various failed ways of "fingerprinting" arbitrary model/pipeline objects I am fairly confident in saying this is a generalizable and hopefully reliable way of asserting that the model_or_pipeline we're caching against is the right one. Right now I've only written canary test cases for the "text-classification" task, but I'm in the midst of adding the other ones. Feedback welcome :)

Closes #126.

Aug 03 '22 03:08 mathemakitten

Thanks @lvwerra!

re: scores — I found out the same thing when I started looking at implementing this for the QA evaluator; the input/output signatures for various evaluators can differ wildly, hence the "if task is text classification, get the scores like this, elif image classification, get scores like that" logic which is sort of messy. To address this I've serialized and hashed the predictions object instead, because that's the last consistent object that we're guaranteed to have. predictions produce consistent hashes as well, allowing for them to be used as the canary_key.
Also, note that right now cache-checking via canaries only works for well-defined evaluator tasks (image + text classification, question-answering and token classification to be implemented). I'm not sure if there's a straightforward way to automatically create canary data for a custom evaluator task; I think it'd have to be user-defined. Let me know if you see otherwise though!
I originally had the canary examples hardcoded but removed them because I wasn't sure if hardcoding data was the way we wanted to go. Your reasoning makes sense, so I've reverted it to be hardcoded for text. Do you think we should also do this for image canary examples? It would just mean that we end up version-controlling (small) image files within the repo, which is perhaps suboptimal. Thoughts?
Moved canary examples from tests to utils/canary!
Agreed! I think the combination of exact-match on predictions + hashing should accomplish this. There's also a warning when we're using cached results. However, I'm not sure if there's a way to know ahead of time if an operation is un-cacheable. I've added a warning for when canaries aren't implemented for that pipe type.

Aug 03 '22 19:08 mathemakitten

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Aug 03 '22 19:08 HuggingFaceDocBuilderDev

scores: so my concern is that for example a user might provide a custom pipeline (e.g. from a sklearn model) that only returns the labels (no scores). I feel the chances that two sentiment classifiers might produce the same three labels on the following inputs is not that small:

"blue whales are so big",
"the ocean is so vast",
"shrimp are the most untrustworthy ocean creature",

Thus I would through a warning if no scores are available from the classification pipeline and say the pipeline is not cachable.

PS: since predictions are always somethink like a list of dicts I think you could just use json to take the predictions and make them a string and hash the string directly. that way you don't need to use dill or write to a file.

Aug 04 '22 17:08 lvwerra

re: scores, I see your point re: a classifier returning only predicted labels without predictions could plausibly output the same predicted labels for all canary examples. Unfortunately it's not straightforward to generically check for the presence of the "scores" key in the dictionary since predictions can vary in format (see below for more detail). We can do this for the text classification pipeline; other evaluators may just need to tweak the compute_canary_hash function accordingly.

re: formatting of predictions as JSON — sometimes they vary slightly and outputs aren't guaranteed to fit into JSON by default (e.g. the token classification evaluator outputs a list of list of dicts, and the text classification evaluator outputs a list of dicts, and the image classification evaluator outputs a list of list of dicts). Because of this, it is simpler to generically hash the entire predictions object when possible to do so according to the above mentioned above, instead of wrangling each case into something json-compatible — serializing predictions with dill (or pickle) works out-of-the-box for all evaluators whereas serializing to JSON takes extra steps. In code, the difference is hashlib.md5(json.dumps(predictions).encode('utf-8')).hexdigest() with possible extra post-processing on predictions needed, vs. hashlib.md5(dill.dumps(predictions)).hexdigest(), with the latter being significantly less cluttered.

Aug 08 '22 23:08 mathemakitten

Hi @lhoestq! @lvwerra suggested borrowing the custom Pickler from datasets as an alternative to this approach if we are worried about canary collisions, since it was mentioned that the datasets team had done some work to extend the pickler. However, trying to pickle pipe or pipe.model results in non-deterministic hashes; I assume due to non-sorted/non-deterministic containers used somewhere in the underlying model code. Can you confirm if you'd expect the following bit to result in deterministic hashes? I'm going to separately look into how difficult it would be to track down the non-determinism, but expecting that it might be fairly tricky.

from evaluate import evaluator
from hashlib import md5
from datasets.utils.py_utils import *
from datasets import Dataset, load_dataset
from datasets.fingerprint import Hasher

data_imdb = Dataset.from_dict(load_dataset("imdb")["test"][:2])
print(f"dataset fingerprint: {data_imdb._fingerprint}")

e = evaluator("text-classification")

MODEL_OR_PIPELINE = 'huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli'

results = e.compute(
    model_or_pipeline=MODEL_OR_PIPELINE,
    data=data_imdb,
    metric="accuracy",
    input_column="text",
    label_column="label",
    label_mapping={"LABEL_0": 0.0, "LABEL_1": 1.0},
)

pipe = e.prepare_pipeline(model_or_pipeline=MODEL_OR_PIPELINE)

h = Hasher()
print(f"hash for this pipe: {md5(dumps(pipe.model)).hexdigest()}")
print(f"other hash for this pipe: {h.hash(dumps(pipe.model))}")

Aug 30 '22 16:08 mathemakitten

Hi ! Oh indeed, and it's not really easy to investigate. If you can't find an easy solution with the Hasher your idea with the canary is maybe easier to implement and maintain. Let me know if I can help regarding the Hasher btw (you can ping me on slack anytime)

Aug 30 '22 18:08 lhoestq

Closing for now since pipe objects can't generically pickled (non-deterministic containers/ops/etc) and we haven't figured out what to do about the possibility of collisions in data canaries.

Sep 12 '22 22:09 mathemakitten