Convert vocabulary types and load model concurrently
Converting vocabulary types and JIT-compiling Numba functions can represent a substantial amount of the compile time for very simple regular expressions. In particular, type conversion needs to happen every time the code is run and takes a few seconds at each session. Here we perform these operations while the model weights are downloaded and loaded on GPU, thus removing some of the overhead associated with the index compilation.
Closes #768.
A few API tweaks are necessary to implement this properly. First, the code that transforms a regex into an index should be decomposed in:
- A function that takes a regex as an argument and returns a byte-level deterministic FSM;
- A function that adapts the vocabulary and then converts the vocabulary types
- A function that takes the converted vocabulary, tokenizer and returns an index
We then initialize model wrappers (e.g. outlines.models.vLLM) with the model instance and the converted vocabulary. Initializing functions (e.g. outlines.model.vllm) accept a model name or a model class.