llama
llama copied to clipboard
Release of data pre-processing code?
As the paper makes quite clear, proper use of opensource datasets can lead to the creation of very high quality models, however it is also clear that pre-processing that data is vital. While it is described at the high-level in the paper, it is likely not sufficient detail to replicate the preprocessing steps. Are there plans to opensource the code needed to turn the existing datasets into a high-quality corpus?
I have same interest and would like to ask how you filters "low quality content" with an n-gram language model?
How can you define "good" vs "bad" data? and in some sense find some metrics to decide how good or how bad a document is?
thanks - we will consider for future model releases. Cheers..