lmql Llama cpp

Description:

This is a rough attempt at adding support for llama.cpp. It adds a separate model serve while breaking out a couple shared interfaces between the llama serve and HF serve. Though far from perfect, perhaps with some suggestions I can get this to a proper state if desired.

What the changes amount to:

An example command to run with llama.cpp looks like the following lmql serve-model AlekseyKorshuk/vicuna-7b --llama.cpp --model_path=../ggml-vicuna-7b-1.0-uncensored-q4_0.bin --n_threads=8

The current problem I have not gotten around is that lmql.runtime.tokenizer requires some tokenizer to load tangentially to the serve and I'm not sure what the best way to make a tokenizer just from the ggml that is compatible with the LMQLTokenizer would be. So, for now, I have just bootstrapped the model arg to require a HF tokenizer that can pair with the llama.cpp ggml.

Apr 14 '23 18:04 CircArgs

Thank you, this is so awesome! I have been meaning to look into llama support myself, but many other things are important right now, too. I will look into this more closely tomorrow, hopefully. I think we can find a good fix for the tokenizer, but the HF workaround is also good to get things running for now.

Thank you so much.

Apr 15 '23 16:04 lbeurerkellner

Just a heads up, I just bumped main to transformers==4.28.1, so this should bring LLamaTokenizer, etc.. But given that you were already using it, I assume you are already running on a later version of transformers. Not providing clean_up_tokenization_spaces seems to cause a bug for now, at least with GPT models. Not sure if True/False is here the right choice, I will have to investigate more.

Apr 18 '23 08:04 lbeurerkellner

@lbeurerkellner I did bump it to transformers==4.28.1 and set clean_up_tokenization_spaces to True which seemed to fix it this is where the error originates. It's looking for the argument or from the config which are both False by default I think.

For this PR though, does this matter? It looked like with https://github.com/eth-sri/lmql/commit/3736fcbebe0e157b09641854479dd9abebc03052 you were moving towards a different design where this "proxy" tokenizer would not be needed anymore.

Apr 18 '23 13:04 CircArgs

@lbeurerkellner Since ggml is making great progress every week and is already providing stellar performance on Apple Silicon and GPUs: Is there any update on finalizing and merging this PR?

Jun 15 '23 07:06 gottlike

Yes. We have recently simplified backend infrastructure greatly, making an integration of llama.cpp much easier. Before that however, we first need to also fix the integration of a Llama tokenizer. The HF one seems to behave in a non-standard way, which needs special handling, e.g. see #95.

In general, though, I am actually quite excited about adding support and I will def try to push this more.

Jun 18 '23 13:06 lbeurerkellner

I just merged llama.cpp support in #111. It leverages the new backend infrastructure, which makes it rather easy to support now. This gives us an initial implementation, however, I think work here definitely could continue to improve support wrt. caching and enable batch_size > 1 on the backend.

Closing this now. Nevertheless, thanks so much for the proposal and initial implementation. It was of great help doing this port to the updated infrastructure.

Jul 13 '23 12:07 lbeurerkellner