starcoder
starcoder copied to clipboard
Effect of FIM on StarCoder pre-training
Hi!
Curious to know some more details about FIM and its effect on the pre-trained model. Here's a paragraph from the SantaCoder paper:
FIM for cheap We observe a minor drop in performance of the FIM model compared to the No-FIM model. Specifically, we see that the pass@100 performance of the FIM model is 2-4% lower on HumanEval and 1% lower on MBPP. While Bavarian et al. (2022) presented evidence for the existence of a FIM-for-free property (i.e., arguing that autoregressive models can be trained with FIM without harming left-to-right capabilities), we do find a small but consistent drop of FIM models on left-to-right text2code benchmarks.
- Was a similar analysis carried out on StarCoder?
- Was StarCoder pre-trained on a 50-50 split between FIM and next-token data? (as indicated in this Megatron script)
Hello, we didn't perform the ablation for StarCoder given the amount of compute it requires for training, but you can check the CodeLLama paper where the authors observed similar behavior at different scales.
Regarding FIM percentage, we used 50%.
Hello, we didn't perform the ablation for StarCoder given the amount of compute it requires for training, but you can check the CodeLLama paper where the authors observed similar behavior at different scales.
Regarding FIM percentage, we used 50%.
i have a question, as the known ratio, many eval ratios drop because of fim under pretrain stage, why you still use fim with 50% percentage?