llama icon indicating copy to clipboard operation
llama copied to clipboard

Release of data pre-processing code?

Open bwerness opened this issue 1 year ago • 1 comments

As the paper makes quite clear, proper use of opensource datasets can lead to the creation of very high quality models, however it is also clear that pre-processing that data is vital. While it is described at the high-level in the paper, it is likely not sufficient detail to replicate the preprocessing steps. Are there plans to opensource the code needed to turn the existing datasets into a high-quality corpus?

bwerness avatar Feb 24 '23 21:02 bwerness

I have same interest and would like to ask how you filters "low quality content" with an n-gram language model?

How can you define "good" vs "bad" data? and in some sense find some metrics to decide how good or how bad a document is?

tiendung avatar Feb 25 '23 02:02 tiendung

thanks - we will consider for future model releases. Cheers..

jspisak avatar Sep 06 '23 17:09 jspisak