starcoder
starcoder copied to clipboard
Is there any code for preparation of the dataset on which Starcoder has been originally trained?
I need to know how to use <filename>
, <fim_*>
and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.
I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.
What's interesting is that after finetuning it seems to be still working with FIM, so finetuning made not the model "forget" the FIM at least completely :)
I've been successfully able to finetune Starcoder on my own code
May I know your hardware that you used for finetuning?
May I know your hardware that you used for finetuning?
8× A100
I adapted the train scripts from the chat
folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.
P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.
Yes you can use the FIM preparation code in Megatron, there's also a FIM implementation here that could be easier to integrate with the current codebase. As for the data preparation we have the code at bigcode-dataset including how we added the special code tokens
May I know your hardware that you used for finetuning?
8× A100
I adapted the train scripts from the
chat
folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.
May I ask if there are any relevant scripts and tutorials for reference?
May I know your hardware that you used for finetuning?
8× A100
I adapted the train scripts from the folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.
chat
P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.
So how much time did you spend?