tokenizers
tokenizers copied to clipboard
Add the ability to serialize custom Python components
It is currently impossible to serialize custom Python components, so if a Tokenizer
embeds some of them, the user can't save it.
I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.
This is a useful feature. We can probably serialize Python objects using pickle
or dill
. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?
The end result has to be saved as JSON, I don't think it's doable. Also pickle is highly unsafe and not portable (despite being widely used).
Currently the workaround, is to override the component before save, and override after load
tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")
## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()
It is a bit inconvenient but at least it's safe and portable.
You also can't load it as a PreTrainedTokenizerFast
if you have a custom component.
from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
As a workaround I do
from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())
but using overriding using the private _tokenizer
maybe unpredictably problematic.
Totally understandable.
What kind of pre-tokenizer are you saving ? If some building blocks are missing we could add them to make the thing more composable/portable/shareable.
Is now can saving the custom pretokenizer?
No. custom is python code, it's not serializable by nature.