tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Add the ability to serialize custom Python components

Open n1t0 opened this issue 4 years ago • 2 comments

It is currently impossible to serialize custom Python components, so if a Tokenizer embeds some of them, the user can't save it.

I didn't really dig this so I don't know exactly what would be the constraints/requirements, but this is something we should explore at some point.

n1t0 avatar Jan 06 '21 15:01 n1t0

This is a useful feature. We can probably serialize Python objects using pickle or dill. However the serialization code is in Rust. Is it possible to serialize the custom Python components with pickle?

ibraheem-moosa avatar Feb 13 '22 04:02 ibraheem-moosa

The end result has to be saved as JSON, I don't think it's doable. Also pickle is highly unsafe and not portable (despite being widely used).

Currently the workaround, is to override the component before save, and override after load

tokenizer.pre_tokenizer = Custom()
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.save("tok.json")

## Load later
tokenizer = Tokenizer.from_file("tok.json")
tokenizer.pre_tokenizer = Custom()

It is a bit inconvenient but at least it's safe and portable.

Narsil avatar Feb 14 '22 09:02 Narsil

You also can't load it as a PreTrainedTokenizerFast if you have a custom component.

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)

As a workaround I do

from transformers import PreTrainedTokenizerFast
fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer._tokenizer.pre_tokenizer=PreTokenizer.custom(CustomPreTokenizer())

but using overriding using the private _tokenizer maybe unpredictably problematic.

cceyda avatar Apr 06 '23 20:04 cceyda

Totally understandable.

What kind of pre-tokenizer are you saving ? If some building blocks are missing we could add them to make the thing more composable/portable/shareable.

Narsil avatar Apr 07 '23 09:04 Narsil

Is now can saving the custom pretokenizer?

luvwinnie avatar Aug 27 '23 03:08 luvwinnie

No. custom is python code, it's not serializable by nature.

Narsil avatar Aug 28 '23 07:08 Narsil