starcoder Is there any code for preparation of the dataset on which Starcoder has been originally trained?

I need to know how to use <filename>, <fim_*> and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.

I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.

May 30 '23 18:05 xpl

What's interesting is that after finetuning it seems to be still working with FIM, so finetuning made not the model "forget" the FIM at least completely :)

May 30 '23 18:05 xpl

I've been successfully able to finetune Starcoder on my own code

May I know your hardware that you used for finetuning?

May 31 '23 15:05 seyyedaliayati

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

May 31 '23 16:05 xpl

Yes you can use the FIM preparation code in Megatron, there's also a FIM implementation here that could be easier to integrate with the current codebase. As for the data preparation we have the code at bigcode-dataset including how we added the special code tokens

Jun 01 '23 15:06 loubnabnl

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

May I ask if there are any relevant scripts and tutorials for reference？

Jun 11 '23 10:06 1920853199

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.chat

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

So how much time did you spend？

Jun 11 '23 11:06 1920853199

starcoder starcoder copied to clipboard

Is there any code for preparation of the dataset on which Starcoder has been originally trained?

starcoder
starcoder copied to clipboard