torchtune
torchtune copied to clipboard
Seeking guidance on continuing pretraining
Hello, thanks very much for the excellent work on this repo.
There are several examples showing how to create a question-response style dataset, but I can't immediately tell how to continue pretraining with, for example, a corpus of unstructured text.
Are there any examples showing how to pack text examples and continue pretraining?
Thank you
Hi @calmitchell617, welcome to the repo and thanks for opening this! Sample packing and unstructured datasets for CPT have been on top of my to-do list for sometime. Do you mind sharing some examples of text corpuses you might want to train with?
@RdoubleA, thanks for the fast response.
One good example might be The Stack V1, as it is a helpful starting point when training coding assistants. I know a few people personally who would appreciate seeing an example using that dataset in particular.
I would be happy to attempt contributing a PR. I have already looked at the code quite a lot today, and will continue to work on the issue on my own, anyways.
Yeah that looks like a great example, and would be good to have that in the repo. Since it's a massive dataset, you would need to add streaming support via load_dataset(stream=True)
. If you're interested in opening up a PR, I'm more than happy to take a look at it and work with you on it.
Great, I will make a PR in the next few days.
Looks like it was solved with the PR. Please feel free to reopen this issue. Thanks!