gritlm icon indicating copy to clipboard operation
gritlm copied to clipboard

Add support for encoding pretokenized sequences

Open kabachuha opened this issue 1 year ago • 4 comments
trafficstars

Useful for batch processing and making embeddings cache of numerous documents with dataloaders.

The results for dict and the vanilla strings list are identical, although for the raw tokenized 'transformers' encoding it differs a bit, but I think it's just the behavior of that library.

Снимок экрана 2024-06-16 124103 Снимок экрана 2024-06-16 124134 Снимок экрана 2024-06-16 124039

kabachuha avatar Jun 16 '24 09:06 kabachuha

Nice! It is odd that it differs - How do you instantiate the tokenizer? Maybe there is a special token that's missing or something similar

Muennighoff avatar Jun 16 '24 14:06 Muennighoff

from gritlm import GritLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("GritLM/GritLM-7B")

tokenizer_max_length = 300

# the part with docs
...

tokenizer_output_x = tokenizer(
    documents,
    padding='max_length',
    truncation=True,
    max_length=tokenizer_max_length,
    return_tensors="pt",
)

Nothing unusual, but I do set the max length to enable batch encode

kabachuha avatar Jun 16 '24 18:06 kabachuha

Can you try without the max length and see if you get the same results? I think the results should be exactly the same.

Muennighoff avatar Jun 16 '24 19:06 Muennighoff

Alright, thank you for noticing! I've found the problem:

I did a generation-only test earlier in the notebook, and it did

Setting pad_token_id to eos_token_id:2 for open-end generation.

Now without launching a generation cell first, the results with dictionary and the tokenizer output class are exactly the same

image

image

kabachuha avatar Jun 16 '24 20:06 kabachuha