RBERT icon indicating copy to clipboard operation
RBERT copied to clipboard

Save tokenizer as part of model

Open jonthegeek opened this issue 5 years ago • 1 comments

The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).

Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).

jonthegeek avatar Jan 19 '20 21:01 jonthegeek

Note: I tested preprocessing the config json vs saving it as-is, preprocessing saves microseconds, so it probably isn't worth messing with. It wouldn't HURT, though, so I may do the same fix for that one when I do the tokenizer.

jonthegeek avatar Jan 19 '20 21:01 jonthegeek