RBERT
                                
                                 RBERT copied to clipboard
                                
                                    RBERT copied to clipboard
                            
                            
                            
                        Save tokenizer as part of model
The tokenizer for a given model is deterministic (it only depends on the vocab file + whether it's cased). Producing the tokenizer takes 100x as long as loading a pre-processed tokenizer (about 4 s vs 40 ms for bert_base_uncased).
Save the tokenizer as part of the download process. If a model has a vocab but not a tokenizer, save a tokenizer once and then use it going forward (for backward compatibility with things that are already downloaded).
Note: I tested preprocessing the config json vs saving it as-is, preprocessing saves microseconds, so it probably isn't worth messing with. It wouldn't HURT, though, so I may do the same fix for that one when I do the tokenizer.