Rodney Kinney comments

Results 25 comments of


                                            Rodney Kinney

Collect 2T tokens

A snapshot is divided into 1590 shards. Here's a token count for English-classified documents from a single shard of the 2019-09 snapshot: ``` $ gunzip --stdout ./0718/en_all.json.gz | jq '.raw_content'...

Collect 2T tokens

> How do the token counts fall off when we add more snapshots? The CCNET paper asserts "There is little content overlap between monthly snapshots" without explicitly computing the drop-off....

Collect 2T tokens

Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that...

Collect 2T tokens

> Is it measuring by number of paragraphs removed, or number of characters? Those are characters.

Collect 2T tokens

I have the pipeline tuned and running end-to-end. I've uploaded some sample data to `s3://ai2-llm/pretraining-data/sources/common-crawl/samples/2019-09` The data is split by language. For each language, we have the option to split...

Collect 2T tokens

Running on a `u-3tb1` gives you more RAM per CPU, so the wall-clock time and dollar cost would be lower, about 17 instance hours and $450.

Collect 2T tokens

Completed a run on a single snapshot to my satisfaction. Not uploading the full data to S3, but preserving it in [this snapshot](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#SnapshotDetails:snapshotId=snap-0092999ca28c0e6e1) I will tweak the configuration and start...