qlora How to pretrain "raw" text?

trafficstars

Hi! I would like to use QLora to "pretrain" a model and wanted to ask if that is possible, in the release time of qlora I've heard something about a 'raw' mode not existing right now

For example, let's say I had a big dataset in the style of 'the pile' but in another language, how can I pretrain a llama model with that without construction complete prompt response pairs? Or is QLora only designed for full prompt - response pairs?

I am looking very forward to any help!

Jul 13 '23 01:07 SinanAkkoyun

Hi, I've ran into the same task, is there any suggestions how to approach it?

Jul 21 '23 09:07 nerusskikh

As you point out, pretraining and finetuning are similar concepts. In fact, the way we load the Guanaco Open Assistant dataset is similar to how you would load an unlabeled dataset. Just leave the input field blank and put your unlabeled data directly in the output field in the dataset. You will need to adjust the number of tokens you accept in the source/target.

Jul 21 '23 09:07 artidoro

Oh, so I could for example just provide data like 'the stack' in the output only? Would that computationally be the same as splitting a 'page' of data to input and output randomly multiple times? (so what I am asking is, is the input/output computationally irrelevant in the sense that putting unlabeled data in output is the same as mixmatching input output?)

Thank you very much for your answer :)

Jul 21 '23 22:07 SinanAkkoyun

@artidoro Thank you very much for the clarification! If I understood everything correctly, we should put the raw text in the "output" field in json solely. That pretty much mean no system command provided to the LLM and no context (input). This is the same as a plain causal language model training. Though we should care about 'pagification' of the data.

Jul 22 '23 16:07 nerusskikh

qlora qlora copied to clipboard

How to pretrain "raw" text?

qlora
qlora copied to clipboard