starcoder
starcoder copied to clipboard
Fine-tuning Starcoder or Octocoder for IDE Integration: Instruction Tuning vs Base Model Training Approach
When aiming to fine-tune starcoder or octocoder on a custom dataset for integration with an IDE, would it be more appropriate to process the data in a question & answer format by masking custom code for instruction tuning, or would it be better to train it like a base model, utilizing concat tokens to attach the entire code and maintain identical input labels for certain sequence units? Could you share any opinions or experiences regarding this?
For Code completion in The IDE (GitHub copilot style) we recommend just combining the code files like we did for pre-training, for chat-like applications and instruction tuning it's more common to use the instruction/answer format