metaseq
metaseq copied to clipboard
Data processing details in pretraining
Hello, I have a question about data preprocessing during the pretraining phase. I know that OPT is similar to GPT training that groups text into a long chunk of text (length=2048). My question is whether there is special token between the grouped text. For example, we have:
sent1\n sent2\n sent3\n sent4\n
in the training dataset. During pretraining, they will be grouped together into "sent1 sent2 sent3 sent4". My question is: are they concatenated with space? Or a special token like ? Or a line break token?
The pretraining details may be helpful for us to use the model properly.
Thanks
I have the same question too. And I also want to confirm if the same preprocessing is applied on conversational data (PushShift.io Reddit) ?
Thanks.