starcoder icon indicating copy to clipboard operation
starcoder copied to clipboard

Is there any code for preparation of the dataset on which Starcoder has been originally trained?

Open xpl opened this issue 1 year ago • 6 comments

I need to know how to use <filename>, <fim_*> and other special tokens listed in tokenizer special_tokens_map when preparing the dataset.

I've been successfully able to finetune Starcoder on my own code, but I haven't specially prepared the dataset for FIM, so I feel the result could be inferior, as the VSCode extension uses FIM.

xpl avatar May 30 '23 18:05 xpl

What's interesting is that after finetuning it seems to be still working with FIM, so finetuning made not the model "forget" the FIM at least completely :)

xpl avatar May 30 '23 18:05 xpl

I've been successfully able to finetune Starcoder on my own code

May I know your hardware that you used for finetuning?

seyyedaliayati avatar May 31 '23 15:05 seyyedaliayati

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

xpl avatar May 31 '23 16:05 xpl

Yes you can use the FIM preparation code in Megatron, there's also a FIM implementation here that could be easier to integrate with the current codebase. As for the data preparation we have the code at bigcode-dataset including how we added the special code tokens

loubnabnl avatar Jun 01 '23 15:06 loubnabnl

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the chat folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

May I ask if there are any relevant scripts and tutorials for reference?

1920853199 avatar Jun 11 '23 10:06 1920853199

May I know your hardware that you used for finetuning?

8× A100

I adapted the train scripts from the folder — this one doesn't use LoRA and eats lots of memory. Had to decrease batch size, otherwise it OOMed. It also OOMs on trying to resume from checkpoint (there must be some memory-wise inefficiencies in the restoring code). But otherwise it runs pretty fast on an 330 Mb dataset.chat

P.S. Regarding my original question — I've found the FIM preparation code in the Megatron-LM codebase. Seems that it's what they used to train Starcoder.

So how much time did you spend?

1920853199 avatar Jun 11 '23 11:06 1920853199