aici duplicate tokens in tokenizers

duplicate tokens in tokenizers

Open mmoskal opened this issue 1 year ago • 1 comments

For example, the llama tokenizer has "<0x20>" as 35 and "▁" (space) as 29871, as well as "<0x21>" as 36 and "!" as 29991, etc.

We need to:

pick the canonical form (29871 probably)
have a mapping on the side that if 29871 is allowed also allows 35 in TokenSet (apply it after "compute_bias()" etc).

Mar 18 '24 20:03 mmoskal

mostly done, need to call apply_duplicates() in more places in particular somewhere around return_logit_bias() and possibly after any user-level update to token set

Mar 18 '24 21:03 mmoskal

aici aici copied to clipboard

duplicate tokens in tokenizers

aici
aici copied to clipboard