CodeT5 Pretraining Dataset

Pretraining Dataset

Open ShushanArakelyan opened this issue 2 years ago • 1 comments

Hi, could you please clarify the dataset you use for pretaining? In one of the earlier answers, I saw a mention of using "all non-valid/test examples from CodeSearchNet for pretaining". Does this mean examples from the training portion of the dataset?

For example, for Python, I counted only 412178 such examples in CodeSearchNet, but in the paper, you mention you used 453,772 Python examples for pretraining from CodeSearchNet. I am not sure where this inconsistency comes from.

Also, you refer to CodeBERT when talking about pretraining data, but your dataset statistics numbers are also slightly different from the numbers in the CodeBERT paper, were there any additional filtering steps that you did for CodeT5?

Thanks in advance!

Aug 09 '22 21:08 ShushanArakelyan

Hi, I remember that the training portion is not the same as non-valid/test portion and it is actually smaller. You can try to verify this. For data filtering, we might filter some too short or too long codes.

Aug 10 '22 09:08 yuewang-cuhk

CodeT5 CodeT5 copied to clipboard

Pretraining Dataset

CodeT5
CodeT5 copied to clipboard