code-llama-for-vscode
code-llama-for-vscode copied to clipboard
Adapt to use Hugging Face models (includes streaming)
Hi
I thought about how to implement the streaming functionality and saw that the only way was to re-write the generation functions in codellama, which seemed a bit messy. Simulaneously, Hugging Face released the models in their format, so I thought the easiest thing to do would be to use them.
Advantages:
- Makes more accessible to anyone (no need to download the checkpoints manually from Meta)
- It is easy to load quantized models (I added a
load_in_4bitflag) - Your Flask server is simplified because the parallelization is handled by the
transformerslibrary, so you only have one instance of the server running (i.e., no need to mess withtorch.distributed). - Streaming is pretty straightforward.
- We could easily adapt it to use
text-inference-serverin the backend, which is much faster than the regular generation.
Disadvantages:
- For some reason, I get worse results from the Hugging Face version of the 13b instruct model (even without quantization). For example, if I ask it
Tell me a joke in C
I get responses similar to this:
A C Programmer's Buggy Journey
Sure! Here's a joke in C: Why did the C programmer go to the doctor? Because he was feeling a little "buggy"! I hope you found that joke in C to be "buggy" and "funny"!
and sometimes it spits out endless \n tokens instead of stopping when it should. When I run your code using the Meta checkpoints I get something like
"Chicken Joke: A Play on Words" Sure, here's a joke in C:
#include <stdio.h> int main() { printf("Why did the chicken cross the playground?\n"); printf("To get to the other slide!\n"); return 0; }This joke is a play on words, as "slide" can refer to both a toy slide and a software slide.
I mean, the jokes are terrible, but at least it writes it in C as instructed.
Anyway, I thought I would create this Pull Request so you could play around with it. I'd be interested to know whether you think this is a good direction to go in.