code-bert icon indicating copy to clipboard operation
code-bert copied to clipboard

90% accuracy claim.

Open varadhbhatnagar opened this issue 4 years ago • 6 comments

It is mentioned on the website (https://codist-ai.com/) that this model gives 90% accuracy. Can you elaborate what exactly is this accuracy and how is it measured?

varadhbhatnagar avatar Sep 16 '20 11:09 varadhbhatnagar

Hello @varadhbhatnagar we did a pre-training on a corpus of over 6M lines of logical python code. (we injected some special tokens such as (indent), (dedent) etc. to keep the logical structure of the code.) And then we fine tuned the model on a Binary Classification problem. Where the model is shown a pair of tokens, where we have the first set of tokens from code and the second set from the comments and the task is to predict whether they match or not. We fine tuned for this task using about 35K pairs. In that task, at the training time, the training F1 score reaches 90%.

Hope this answers your question.

rcshubhadeep avatar Sep 16 '20 14:09 rcshubhadeep

Thanks. Is there a paper associated with this project? And it this related to the Microsoft Codebert in anyway?

varadhbhatnagar avatar Sep 16 '20 15:09 varadhbhatnagar

It is not related to MS Codebert (Apart from sharing the same name). The methodology we followed is inspired by the CuBERT paper with our own methods and ideas blend into it. We have not published any paper on yet. But the model is open sourced for everyone to use.

rcshubhadeep avatar Sep 16 '20 15:09 rcshubhadeep

Thanks for asking the questions 👍

rcshubhadeep avatar Sep 16 '20 15:09 rcshubhadeep

I wanted to get an idea about the method complexity that this model can handle. For training and testing, did you use simple methods similar to files in test_files directory ?

varadhbhatnagar avatar Sep 19 '20 12:09 varadhbhatnagar

Hi,

We fine tuned this model on the task using py150k Dataset

But just to clarify this dataset has 150K Python files. We used our open source library tree-hugger to mine those files to create a (method, docstring) tuple dataset. We then swapped about 50% of those docstrings and marked them as a negative class while the rest is positive. And then used the pretrained model for fine tuning on this task.

rcshubhadeep avatar Sep 20 '20 12:09 rcshubhadeep