starcoder
starcoder copied to clipboard
Can/How StarCoder model can be used for encoding?
Beside the well-kown ChatGPT, now more and more startups and researchers note the great value and potential in OpenAI embedding API (https://platform.openai.com/docs/guides/embeddings). It enables many domain-specific adaptation and applications, like LLaMa-index, soft prompting, retrieval-augmented generation, etc.
Therefore, I wonder if StarCoder can be used for encoding? If the anwser is Yes, how should we make it usable? By modifying the network layers or solely the inference code?
I know there is StarEncoder~125M, is it already ok for encoding?
I believe as it's a decoder-only architecture, you can't encode with it?
But correct me if I'm wrong.
Any luck with this?
@dpfried @lvwerra can you please help?
You can always get the hidden states of the model and use those as embeddings. We have never benchmarked how good they are for the decoder but @joaomonteirof has benchmarked the encoder models a bit!
I think StarCoder's top layer hidden states could work well. For StarEncoder, we did some code-to-code retrieval evaluations after pre-training and results were quite promising. Relevant discussions on how to get chunk-level embeddings here:
- https://github.com/bigcode-project/bigcode-encoder/issues/14
- https://huggingface.co/bigcode/starencoder/discussions/3