finetune-gpt2xl
finetune-gpt2xl copied to clipboard
Feeding the model separate examples instead of one continuous block of text
Hello I'm interested in adding this feature anding a function in text2csv.py to take a folder of texts and then in run_clm.py pad and truncate them instead of the group_text function.
I'm using songs for my data the line new line spacing is important and i would like them to be separate while fine tuning so the end of one song isn't the start of another. I have it create the csv's so that each row is a song but then when it gets group_text applied to it it concatenates them all and make blocks of 1024. looking into trynig to add the DataCollatorWithPadding but not having much luck at the moment
i also notice that its using <|endoftext|> as bos_token and eos_token wondering how that would affect things and if what im doing is even needed if or if i should just have theses tokens between my examples. from the config.json in the model "bos_token_id": 50256, "embed_dropout": 0, "eos_token_id": 50256,