torchtune icon indicating copy to clipboard operation
torchtune copied to clipboard

Seeking guidance on continuing pretraining

Open calmitchell617 opened this issue 10 months ago • 4 comments

Hello, thanks very much for the excellent work on this repo.

There are several examples showing how to create a question-response style dataset, but I can't immediately tell how to continue pretraining with, for example, a corpus of unstructured text.

Are there any examples showing how to pack text examples and continue pretraining?

Thank you

calmitchell617 avatar Apr 19 '24 09:04 calmitchell617

Hi @calmitchell617, welcome to the repo and thanks for opening this! Sample packing and unstructured datasets for CPT have been on top of my to-do list for sometime. Do you mind sharing some examples of text corpuses you might want to train with?

RdoubleA avatar Apr 19 '24 14:04 RdoubleA

@RdoubleA, thanks for the fast response.

One good example might be The Stack V1, as it is a helpful starting point when training coding assistants. I know a few people personally who would appreciate seeing an example using that dataset in particular.

I would be happy to attempt contributing a PR. I have already looked at the code quite a lot today, and will continue to work on the issue on my own, anyways.

calmitchell617 avatar Apr 19 '24 14:04 calmitchell617

Yeah that looks like a great example, and would be good to have that in the repo. Since it's a massive dataset, you would need to add streaming support via load_dataset(stream=True). If you're interested in opening up a PR, I'm more than happy to take a look at it and work with you on it.

RdoubleA avatar Apr 19 '24 15:04 RdoubleA

Great, I will make a PR in the next few days.

calmitchell617 avatar Apr 19 '24 15:04 calmitchell617

Looks like it was solved with the PR. Please feel free to reopen this issue. Thanks!

felipemello1 avatar Jun 28 '24 15:06 felipemello1