starcoder
starcoder copied to clipboard
Code translations
Is it possible to do java-cs/cs-java code translations as mentioned in https://github.com/salesforce/CodeT5 ?
ASAK, some models (like CodeGeex) use special tokens to indicate/distinguish the language explicitly, so translating is easy. No idea if StarCoder also have such tokens.
Regarding the special tokens, we did condition on repo metadata during the training
We prepended the repository name, file name, and the number of stars to the context of the code file. To not overfit on the exact number of stars, we categorized GitHub stars into five buckets: 0, 1–10, 10–100, 100–1000, 1000+. To enable the model to operate without this metadata during inference, we prefixed the repository name, filename, and stars independently at random, each with a probability of 0.2. <reponame>REPONAME<filename>FILENAME<gh_stars>STARS\nCode<eos>
So you can for example do
<filename>python_code/file.py\n
As for translation, the model wasn't specifically trained to translate code but you can craft a prompt to do so (see this example).
we prefixed the repository name, filename, and stars independently at random, each with a probability of 0.2.
REPONAME FILENAME<gh_stars>STARS\nCode
Thanks for responding! I have read through the paper, but I found it is a bite confusing here: we prefixed the repository name, filename, and stars independently at random, each with a probability of 0.2. <reponame>REPONAME<filename>FILENAME<gh_stars>STARS\nCode<eos>
, does it means the three are chosen or not chosen at 20% chance, thus producing such samples:
<reponame>REPONAME<filename>FILENAME<gh_stars>STARS\nCode<eos> (0.2*0.2*0.2 probability)
<reponame>REPONAME<gh_stars>STARS\nCode<eos> (0.2*0.8*0.2 probability)
<gh_stars>STARS\nCode<eos> (0.8*0.8*0.2 probability)
\n Code<eos> (0.8*0.8*0.8 probability)
Not sure if I understand that right, is the corresponding code available?
That's correct, you can find the code here: https://github.com/bigcode-project/bigcode-dataset/blob/0b3c1ba500d132b14654a3d07c67f069e7f07410/preprocessing/add_content_with_meta.py#L41
ASAK,一些模型(如CodeGeex)使用特殊标记来明确指示/分区语言,因此翻译很容易。不知道StarCoder是否也有这样的标记。
你好,我也在找可以做代码翻译的模型,请问你有找到准确度比较高的吗,我试了codegeex发现错的很离谱
是否可以按照https://github.com/salesforce/CodeT5中所述进行 java-cs/cs-java 代码翻译?
Hello, I'm also looking for a model that can do code translation. Have you found one with high accuracy? I tried codegeex and found it had a lot of mistakes.
@meihao5631 seq2seq NMT model may be better at this specific task, like TransCoder (https://github.com/facebookresearch/TransCoder) from Meta AI. Once I used it to translate between Java and Python for OJ problems, and it turns out to be good enough.
@loubnabnl > As for translation, the model wasn't specifically trained to translate code but you can craft a prompt to do so (see this example). Can you write a demo, thanks