CodeT5
CodeT5 copied to clipboard
Pretraining Dataset
Hi, could you please clarify the dataset you use for pretaining? In one of the earlier answers, I saw a mention of using "all non-valid/test examples from CodeSearchNet for pretaining". Does this mean examples from the training portion of the dataset?
For example, for Python, I counted only 412178 such examples in CodeSearchNet, but in the paper, you mention you used 453,772 Python examples for pretraining from CodeSearchNet. I am not sure where this inconsistency comes from.
Also, you refer to CodeBERT when talking about pretraining data, but your dataset statistics numbers are also slightly different from the numbers in the CodeBERT paper, were there any additional filtering steps that you did for CodeT5?
Thanks in advance!
Hi, I remember that the training portion is not the same as non-valid/test portion and it is actually smaller. You can try to verify this. For data filtering, we might filter some too short or too long codes.