Rodney Kinney
Rodney Kinney
A snapshot is divided into 1590 shards. Here's a token count for English-classified documents from a single shard of the 2019-09 snapshot: ``` $ gunzip --stdout ./0718/en_all.json.gz | jq '.raw_content'...
> How do the token counts fall off when we add more snapshots? The CCNET paper asserts "There is little content overlap between monthly snapshots" without explicitly computing the drop-off....
Yes, those graphs are saying that you are left with only 30% of the content after deduping each line with a random 1% sampling of other lines. It means that...
> Is it measuring by number of paragraphs removed, or number of characters? Those are characters.
I have the pipeline tuned and running end-to-end. I've uploaded some sample data to `s3://ai2-llm/pretraining-data/sources/common-crawl/samples/2019-09` The data is split by language. For each language, we have the option to split...
Running on a `u-3tb1` gives you more RAM per CPU, so the wall-clock time and dollar cost would be lower, about 17 instance hours and $450.
Completed a run on a single snapshot to my satisfaction. Not uploading the full data to S3, but preserving it in [this snapshot](https://us-east-1.console.aws.amazon.com/ec2/home?region=us-east-1#SnapshotDetails:snapshotId=snap-0092999ca28c0e6e1) I will tweak the configuration and start...
Within a single dump, there is < 1% duplication by URL: ``` SELECT bucket, count(*) FROM ( SELECT url, CASE WHEN count = 1 THEN '1' WHEN count < 6...
> Within a single dump, there is < 1% duplication by URL Athena timed out running the same query across multiple dumps
With 3.1B unique URLs per dump, it would take about 70GB of RAM to hash them into the same data structure used by `cc_net` for paragraph-level deduping. So we could...