Convert vocabulary types and load model concurrently

Open rlouf opened this issue 1 year ago • 1 comments

Converting vocabulary types and JIT-compiling Numba functions can represent a substantial amount of the compile time for very simple regular expressions. In particular, type conversion needs to happen every time the code is run and takes a few seconds at each session. Here we perform these operations while the model weights are downloaded and loaded on GPU, thus removing some of the overhead associated with the index compilation.

Closes #768.

Apr 21 '24 11:04 rlouf

A few API tweaks are necessary to implement this properly. First, the code that transforms a regex into an index should be decomposed in:

A function that takes a regex as an argument and returns a byte-level deterministic FSM;
A function that adapts the vocabulary and then converts the vocabulary types
A function that takes the converted vocabulary, tokenizer and returns an index

We then initialize model wrappers (e.g. outlines.models.vLLM) with the model instance and the converted vocabulary. Initializing functions (e.g. outlines.model.vllm) accept a model name or a model class.

Apr 24 '24 11:04 rlouf