CodeGen icon indicating copy to clipboard operation
CodeGen copied to clipboard

Ablation on data size

Open yssjtu opened this issue 2 years ago • 2 comments

Hi, appreciate the amazing work in unsupervised code translation! I wonder if you have done ablation study on the training data size of TransCoder? Since the unsupervised model needs way much more training data (over 500M functions for 3 languages ) than the existing code PLMs, like CodeT5 (8.35M for 7 languages). How's the performance of Transcoder if less data provided?

yssjtu avatar Mar 15 '22 03:03 yssjtu

Hi, Thank you. We have not really done an ablation study on the dataset size. However, the numbers you are quoting are for non deduplicated functions. We get about the same results training on around 15M deduped functions. I also remember that we were losing only a few points of computational accuracy when using only a fraction (1/8th) of the data.

baptisteroziere avatar Mar 15 '22 11:03 baptisteroziere

Hi, thanks for the quick reply! I see that TransCoder use functions in the training of DAE and BT. But it uses complete source codes for XLM.(https://github.com/facebookresearch/TransCoder#data-needed) So the 15M deduped functions for DAE and BT? What about the data size used in XLM?

yssjtu avatar Mar 15 '22 11:03 yssjtu