dice-embeddings icon indicating copy to clipboard operation
dice-embeddings copied to clipboard

Tokenizer used for context length checking

Open LckyLke opened this issue 7 months ago • 2 comments

https://github.com/dice-group/dice-embeddings/blame/674e9f5e521e304691ef063f9f79b23e0a5f8ef2/retrieval_aug_predictors/models/RALP.py#L59C2-L59C64

Why do we use gpt-3.5-turbo tokenizer here? Is this the one used by the current LM we are using? Also, shouldnt this be variable depending on the model used? :) @alkidbaci

LckyLke avatar Apr 28 '25 14:04 LckyLke

I was actually looking for a convenient way to measure tokens required and since this tiktoken library was already installed together with some other dependency then I went with it. I just needed a rough estimation of the tokens and "gpt-3.5-turbo" tokenizer provides just that. I limit at 25000 as a highest possible value just so it does not throw an error. Although our current LLM allows for more, I think tokens after 25000 limit will not contribute that much anyway.

So to sum up, this is just temporary solution so that the program does not crash even when trying to use in some larger datasets and I'm currently exploring solutions to sample triples that goes into the context window (in such a way that we dont have to worry about reaching the token limit).

alkidbaci avatar Apr 28 '25 14:04 alkidbaci

Ok thanks 👍🏼 I guess we should just keep in mind that this temporary solution exists if another model is ever used :)

LckyLke avatar Apr 28 '25 15:04 LckyLke