lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

add tokenizer logs info

Open artemorloff opened this issue 10 months ago • 0 comments

may be useful for researchers to get tokenier info. The results may depend on pad_token that may be absent (and replaced by UNK or EOS) which should be visible to check after the model run. Also if user limits the context by using pretrained=model,max_length=N the N value is stored for the same purposes - it affects the metrics of the same model. Simple example: Mistral-7B-v0.1 can handle 32k tokens context, but my resources may not allow for all 32k tokens, but for 10k tokens. This key will help comparing the results without necessity to read the name of the model.

artemorloff avatar Apr 22 '24 13:04 artemorloff