llama Paper question: Was there more processing on the books data than was noted?

Paper question: Was there more processing on the books data than was noted?

Open micimize opened this issue 1 year ago • 2 comments

Hi – I've been looking at the books slice of the pre-training dataset quite a bit, and I can't figure out how the original processing resulted in only 85GB of data.

The red pajama books replication resulted in 119GB of data using just pg19, which I would expect to be a bit smaller than the most recent gutenberg dumps.

Was there some additional quality filtering done on the books data? It would make sense, given that some of it looks rather garbled. I guess it could also be explained by a different approach to shingling generally, such as using a much smaller shingle size, or doing char-shingles rather than full-word shingles? But even then, 35 GB of data is a lot, and it doesn't look to me like red pj is doing anything busted in their script.

Thanks, Michael

May 02 '23 15:05 micimize

Are you doing auto input to IA learn with books?

May 02 '23 18:05 Apollyon81

@Apollyon81 not sure what you mean but I'm just trying to understand the original paper here.

May 02 '23 21:05 micimize

llama llama copied to clipboard

Paper question: Was there more processing on the books data than was noted?

llama
llama copied to clipboard