llama
llama copied to clipboard
Paper question: Was there more processing on the books data than was noted?
Hi – I've been looking at the books slice of the pre-training dataset quite a bit, and I can't figure out how the original processing resulted in only 85GB of data.
The red pajama books replication resulted in 119GB of data using just pg19, which I would expect to be a bit smaller than the most recent gutenberg dumps.
Was there some additional quality filtering done on the books data? It would make sense, given that some of it looks rather garbled. I guess it could also be explained by a different approach to shingling generally, such as using a much smaller shingle size, or doing char-shingles rather than full-word shingles? But even then, 35 GB of data is a lot, and it doesn't look to me like red pj is doing anything busted in their script.
Thanks, Michael
Are you doing auto input to IA learn with books?
@Apollyon81 not sure what you mean but I'm just trying to understand the original paper here.