code-llama-for-vscode Adapt to use Hugging Face models (includes streaming)

Adapt to use Hugging Face models (includes streaming)

Open teticio opened this issue 2 years ago • 0 comments

I thought about how to implement the streaming functionality and saw that the only way was to re-write the generation functions in codellama, which seemed a bit messy. Simulaneously, Hugging Face released the models in their format, so I thought the easiest thing to do would be to use them.

Advantages:

Makes more accessible to anyone (no need to download the checkpoints manually from Meta)
It is easy to load quantized models (I added a load_in_4bit flag)
Your Flask server is simplified because the parallelization is handled by the transformers library, so you only have one instance of the server running (i.e., no need to mess with torch.distributed).
Streaming is pretty straightforward.
We could easily adapt it to use text-inference-server in the backend, which is much faster than the regular generation.

Disadvantages:

For some reason, I get worse results from the Hugging Face version of the 13b instruct model (even without quantization). For example, if I ask it

Tell me a joke in C

I get responses similar to this:

A C Programmer's Buggy Journey

Sure! Here's a joke in C: Why did the C programmer go to the doctor? Because he was feeling a little "buggy"! I hope you found that joke in C to be "buggy" and "funny"!

and sometimes it spits out endless \n tokens instead of stopping when it should. When I run your code using the Meta checkpoints I get something like

"Chicken Joke: A Play on Words" Sure, here's a joke in C:
#include <stdio.h>

int main() {
    printf("Why did the chicken cross the playground?\n");
    printf("To get to the other slide!\n");
    return 0;
}
This joke is a play on words, as "slide" can refer to both a toy slide and a software slide.

I mean, the jokes are terrible, but at least it writes it in C as instructed.

Anyway, I thought I would create this Pull Request so you could play around with it. I'd be interested to know whether you think this is a good direction to go in.

Aug 27 '23 07:08 teticio

code-llama-for-vscode code-llama-for-vscode copied to clipboard

Adapt to use Hugging Face models (includes streaming)

code-llama-for-vscode
code-llama-for-vscode copied to clipboard