llama3 icon indicating copy to clipboard operation
llama3 copied to clipboard

List the "publicly available sources" 15T dataset list from Llama 3

Open bennmann opened this issue 10 months ago • 1 comments

Llama 3 is not reproducible in any meaningful capacity without a list of the dataset sources.

Please release a list of the sources.

bennmann avatar Apr 18 '24 20:04 bennmann

related question: why train only on publicly available data from the internet? if you want quality language and good knowledge, wouldn't you want to train on things like textbooks, historical documents, scientific research papers, and the like? things that you could get in a library? i'm talking like classic fundamental knowledge. training on classical philosophy would probably improve reasoning skills. and training on the OG programming textbooks would be very good for programming.

grothedev avatar Apr 19 '24 04:04 grothedev