evaluate
evaluate copied to clipboard
Adding tokenizer_id to perplexity evaluation
The motivation behind this PR is to allow a tokenizer different from the model. This is useful when you are running in an air-gapped environment and the tokenizer lives in a different place.
As a note, I am new to the project, so feel free to point me in the right direction.
Thanks for the PR! I don't quite follow why the tokenizer id would be different from the model id in that case?
Thanks for taking a look @lvwerra.
When using the hub, tokenizer_id can be different to model_id, and they are two entities that can be optimised and iterated separately, for example, I could be using https://huggingface.co/meta-llama/Llama-2-7b but the tokenizer is not in the hub.
In my particular scenario, I do not use the hub but reference the model by path. I iterate my model and my tokenizer separately and host them in a different bucket each:
perplexity.compute(data=fine_tuned_model_responses, model_id=reference_model_path)['mean_perplexity'])
Because of this, I had to copy over my tokenizer to the same reference_model_path. Which I found inconvenient.
I thought it would be nice for people also to have the ability to do:
perplexity.compute(data=fine_tuned_model_responses, model_id=reference_model_path, tokenizer_id=reference_tokenizer_path)['mean_perplexity'])
Hopefully, this makes sense. If not, happy to learn! Thanks!