evaluate icon indicating copy to clipboard operation
evaluate copied to clipboard

Adding tokenizer_id to perplexity evaluation

Open Albertoimpl opened this issue 2 years ago • 3 comments

The motivation behind this PR is to allow a tokenizer different from the model. This is useful when you are running in an air-gapped environment and the tokenizer lives in a different place.

Albertoimpl avatar Aug 04 '23 15:08 Albertoimpl

As a note, I am new to the project, so feel free to point me in the right direction.

Albertoimpl avatar Aug 04 '23 15:08 Albertoimpl

Thanks for the PR! I don't quite follow why the tokenizer id would be different from the model id in that case?

lvwerra avatar Aug 08 '23 08:08 lvwerra

Thanks for taking a look @lvwerra. When using the hub, tokenizer_id can be different to model_id, and they are two entities that can be optimised and iterated separately, for example, I could be using https://huggingface.co/meta-llama/Llama-2-7b but the tokenizer is not in the hub. In my particular scenario, I do not use the hub but reference the model by path. I iterate my model and my tokenizer separately and host them in a different bucket each:

perplexity.compute(data=fine_tuned_model_responses, model_id=reference_model_path)['mean_perplexity'])

Because of this, I had to copy over my tokenizer to the same reference_model_path. Which I found inconvenient.

I thought it would be nice for people also to have the ability to do:

perplexity.compute(data=fine_tuned_model_responses, model_id=reference_model_path, tokenizer_id=reference_tokenizer_path)['mean_perplexity'])

Hopefully, this makes sense. If not, happy to learn! Thanks!

Albertoimpl avatar Aug 09 '23 07:08 Albertoimpl