code-bert
code-bert copied to clipboard
90% accuracy claim.
It is mentioned on the website (https://codist-ai.com/) that this model gives 90% accuracy. Can you elaborate what exactly is this accuracy and how is it measured?
Hello @varadhbhatnagar we did a pre-training on a corpus of over 6M lines of logical python code. (we injected some special tokens such as (indent), (dedent) etc. to keep the logical structure of the code.) And then we fine tuned the model on a Binary Classification problem. Where the model is shown a pair of tokens, where we have the first set of tokens from code and the second set from the comments and the task is to predict whether they match or not. We fine tuned for this task using about 35K pairs. In that task, at the training time, the training F1 score reaches 90%.
Hope this answers your question.
Thanks. Is there a paper associated with this project? And it this related to the Microsoft Codebert in anyway?
It is not related to MS Codebert (Apart from sharing the same name). The methodology we followed is inspired by the CuBERT paper with our own methods and ideas blend into it. We have not published any paper on yet. But the model is open sourced for everyone to use.
Thanks for asking the questions 👍
I wanted to get an idea about the method complexity that this model can handle. For training and testing, did you use simple methods similar to files in test_files
directory ?
Hi,
We fine tuned this model on the task using py150k Dataset
But just to clarify this dataset has 150K Python files. We used our open source library tree-hugger to mine those files to create a (method, docstring) tuple dataset. We then swapped about 50% of those docstrings and marked them as a negative class while the rest is positive. And then used the pretrained model for fine tuning on this task.