lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Introduce perplexity per token in loglikelihood_rolling
Hi,
Thanks for your excellent work in developing this forward.
I am interested in adding the task explained in Yarn related to computing the perplexity for long contexts. However, I find that when using the loglikelihood_rolling, the only accepted metrics are link to source:
- word_perplexity
- byte_perplexity
- bits_per_byte
I have not yet managed to insert a sliding_window variable, but I think I have managed to introduce a token normalized perplexity instead of the previous implementations. I hope it helps!
To test that it works, I just followed the steps:
- Download the repo.
- Create a virtual environment using
python -m venv my_env. - Install with
pip install -e .once you have activated the reposource my_env/bin/activate- - Execute:
model=#Insert the model you want here
few_shot=0
tensor_parallelism=False
num_samples=2
output_dir=results/$(basename ${model})/${few_shot}-shot/results:$(basename ${model}):${dataset}:-shot.json
dataset=proof-pile
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 -m lm_eval --model hf \
--model_args pretrained=$model,trust_remote_code=True \
--tasks ${dataset} \
--num_fewshot $few_shot \
--batch_size 1 \
--output_path $output_dir \
--log_samples \
--seed 1234 \
--limit $num_samples
(The results seem to align with huggingface tests).
Do you plan to merge this at some point or does anything speak against that? From my understanding, most data mixing papers report the perplexity per token, so I am actually surprised to see this is currently not integrated into the eval harness.
Hi, I apologize for leaving this PR unfinished, I had to take on another project and it has been hard to find the time. I still have to solve some minor issues with unit tests, I can try to solve them next month, but it's not a very elegant solution, access to tokens is limited and this was a patch, for other future features they might want to change this code in the future, so I am not sure whether this PR will be finally accepted.