llama3
llama3 copied to clipboard
List the "publicly available sources" 15T dataset list from Llama 3
Llama 3 is not reproducible in any meaningful capacity without a list of the dataset sources.
Please release a list of the sources.
related question: why train only on publicly available data from the internet? if you want quality language and good knowledge, wouldn't you want to train on things like textbooks, historical documents, scientific research papers, and the like? things that you could get in a library? i'm talking like classic fundamental knowledge. training on classical philosophy would probably improve reasoning skills. and training on the OG programming textbooks would be very good for programming.