starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

Fine-tuning Starcoder or Octocoder for IDE Integration: Instruction Tuning vs Base Model Training Approach

Open JunHyungKang opened this issue 2 years ago • 1 comments
trafficstars

When aiming to fine-tune starcoder or octocoder on a custom dataset for integration with an IDE, would it be more appropriate to process the data in a question & answer format by masking custom code for instruction tuning, or would it be better to train it like a base model, utilizing concat tokens to attach the entire code and maintain identical input labels for certain sequence units? Could you share any opinions or experiences regarding this?

JunHyungKang avatar Oct 04 '23 06:10 JunHyungKang

For Code completion in The IDE (GitHub copilot style) we recommend just combining the code files like we did for pre-training, for chat-like applications and instruction tuning it's more common to use the instruction/answer format

loubnabnl avatar Nov 15 '23 15:11 loubnabnl