CodeT5 icon indicating copy to clipboard operation
CodeT5 copied to clipboard

Code similarity CodeT5-large/small

Open lyriccoder opened this issue 1 year ago • 2 comments

Thank you for your interest in utilizing our Codet5 model for code similarity tasks. I have a query regarding its usage in test mode, specifically when comparing only two code snippets. As per the CodexGlue dataset format, the model expects a list of codes and returns the top n most similar examples to a given query. However, I would like to inquire about the possibility of checking the similarity between two specific code snippets. Is there a way to utilize your model for this purpose? I kindly request guidance on obtaining a similarity score, such as a probability, or a binary output (0 or 1) indicating whether the two code snippets are similar or different. For instance, given the following two code snippets:

public void foo() { System.out.println("Hi")}
protected DecryptedEndPoint newDecryptedEndPoint()
    {
        return new DecryptedEndPoint();
    }

Can your model provide insights into their similarity or equivalence?

lyriccoder avatar Jul 25 '23 11:07 lyriccoder

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

yuewang-cuhk avatar Aug 03 '23 08:08 yuewang-cuhk

Hi there, to measure code similarity, I would recommend to use CodeT5+ 110m embedding model to extract the embeddings and compute their similarities, e.g., cosine distance.

Hi, CodeT5+ 110m embedding model has a limit of 512 tokens input, is there any way to increase the input limit of the model ? I would appreciate it if you would give me some advice.

liying-sf avatar Mar 05 '24 07:03 liying-sf