finetune-gpt2xl icon indicating copy to clipboard operation
finetune-gpt2xl copied to clipboard

Feeding the model separate examples instead of one continuous block of text

Open CupOfGeo opened this issue 2 years ago • 1 comments

Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.

CupOfGeo avatar Oct 26 '21 20:10 CupOfGeo

I'm using songs for my data the line new line spacing is important and i would like them to be separate while fine tuning so the end of one song isn't the start of another. I have it create the csv's so that each row is a song but then when it gets group_text applied to it it concatenates them all and make blocks of 1024. looking into trynig to add the DataCollatorWithPadding but not having much luck at the moment

i also notice that its using <|endoftext|> as bos_token and eos_token wondering how that would affect things and if what im doing is even needed if or if i should just have theses tokens between my examples. from the config.json in the model "bos_token_id": 50256, "embed_dropout": 0, "eos_token_id": 50256,

CupOfGeo avatar Oct 28 '21 19:10 CupOfGeo