tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Exposed Unknown Tokens in Tokenizers ?

Open mandubian opened this issue 5 years ago • 4 comments

I'm training tokenizers but I need to manipulate the generated tokens sometimes. In current API, there is no way to access unknown tokens (and others) which are hidden in Models but exposed nowhere AFAIK. Do you see a work-around for that in the current API? Would it be a good idea to expose tokens? (Actually there is no API to access special tokens too.)

mandubian avatar Feb 03 '20 09:02 mandubian

Hi @mandubian. Unfortunately, I'm not sure to entirely understand what you would like to do. Can you be more specific, and provide an example of what you are trying to do?

n1t0 avatar Feb 03 '20 17:02 n1t0

I'd like to replace some tokens by Unknown tokens after tokenizing my data. Imagine some kind of randomness or erasing some information on purpose. But for now, I can't retrieve a posteriori the UNK token which was used to train a tokenizer, I need to know it. Moreover, default Unknown token is not the same in all tokenizers (<unk>, [UNK], ...). Do you see what I mean?

mandubian avatar Feb 03 '20 17:02 mandubian

@mandubian @n1t0 We might expose special tokens vocab on the Tokenizer class so that it's easy for the user to get access to such information.

The concern is, different tokenizers have different set of special tokens (some might be shared accross all the one we have), so it has to be implementation-specific or overridable in an implementation-specific way.

What do you think ?

mfuntowicz avatar Feb 04 '20 14:02 mfuntowicz

Unknown token might be the only one having a transversal role, right? We could imagine expose that token as we know its role on all tokenizer. It would be practical in some cases (like mine), is it critical? I'm not sure.

For the other tokens, yes they are impl specific so at best you can just expose all the special tokens and lets users use them. It would be cool having a notion of "role" of a special token but it's hardly to make that generic IMHO.

mandubian avatar Feb 04 '20 17:02 mandubian

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 02 '24 01:06 github-actions[bot]