offsite-tuning icon indicating copy to clipboard operation
offsite-tuning copied to clipboard

Usage of Pile dataset to train the emulator

Open ziqi-zhang opened this issue 7 months ago • 1 comments

Hi,

I noticed that you trained the NLP emulator with the first 30 chunks of Pile dataset. I wonder how large are the 30 chunks? Or in other words, how many chunks does Pile have? The original Pile dataset is over 800G, it is too big for the labs...

Besides, did you try to use smaller datasets, such as Wikitext? What is the performance of using these smaller datasets?

Thanks

ziqi-zhang avatar Dec 14 '23 03:12 ziqi-zhang