starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

Can/How StarCoder model can be used for encoding?

Open Symbolk opened this issue 1 year ago • 5 comments

Beside the well-kown ChatGPT, now more and more startups and researchers note the great value and potential in OpenAI embedding API (https://platform.openai.com/docs/guides/embeddings). It enables many domain-specific adaptation and applications, like LLaMa-index, soft prompting, retrieval-augmented generation, etc.

Therefore, I wonder if StarCoder can be used for encoding? If the anwser is Yes, how should we make it usable? By modifying the network layers or solely the inference code?

I know there is StarEncoder~125M, is it already ok for encoding?

Symbolk avatar May 30 '23 02:05 Symbolk

I believe as it's a decoder-only architecture, you can't encode with it?

But correct me if I'm wrong.

xpl avatar May 30 '23 18:05 xpl

Any luck with this?

WrViajero avatar Jun 27 '23 03:06 WrViajero

@dpfried @lvwerra can you please help?

ramsey-coding avatar Sep 23 '23 00:09 ramsey-coding

You can always get the hidden states of the model and use those as embeddings. We have never benchmarked how good they are for the decoder but @joaomonteirof has benchmarked the encoder models a bit!

lvwerra avatar Sep 26 '23 09:09 lvwerra

I think StarCoder's top layer hidden states could work well. For StarEncoder, we did some code-to-code retrieval evaluations after pre-training and results were quite promising. Relevant discussions on how to get chunk-level embeddings here:

  • https://github.com/bigcode-project/bigcode-encoder/issues/14
  • https://huggingface.co/bigcode/starencoder/discussions/3

joaomonteirof avatar Sep 26 '23 12:09 joaomonteirof