metaseq icon indicating copy to clipboard operation
metaseq copied to clipboard

Data processing details in pretraining

Open getao opened this issue 2 years ago • 1 comments

Hello, I have a question about data preprocessing during the pretraining phase. I know that OPT is similar to GPT training that groups text into a long chunk of text (length=2048). My question is whether there is special token between the grouped text. For example, we have:

sent1\n sent2\n sent3\n sent4\n

in the training dataset. During pretraining, they will be grouped together into "sent1 sent2 sent3 sent4". My question is: are they concatenated with space? Or a special token like ? Or a line break token?

The pretraining details may be helpful for us to use the model properly.

Thanks

getao avatar Oct 31 '22 07:10 getao

I have the same question too. And I also want to confirm if the same preprocessing is applied on conversational data (PushShift.io Reddit) ?

Thanks.

ouyangliqi avatar Jan 03 '23 07:01 ouyangliqi