Rodney Kinney

Results 25 comments of Rodney Kinney

RedPajama's code produces raw LaTeX. Some cleaning, but un-parsed for the most part. Bibliography is discarded. UnArXive uses [tralics](http://www-sop.inria.fr/marelle/tralics/), a third-party C++ tool that translates LaTex into XML. The unArXive...

The math processing feels like a wash to me, but the XML format seems more useful if you want to produce natural language. You also get control over what to...

Another third-party tool, [pandoc](https://pandoc.org/index.html), gives similar results: ``` Finally, in the Multi-Task Aggregation stage, the different policies are integrated into a multi-task controller that can be directed using language commands...

Using instructions [here](https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/) to get some basic statistics on overlap between different crawls based on the `content_digest` field.

3.1B distinct values for `content_digest` in the latest crawl ``` SELECT count(distinct content_digest) FROM ai2_llm.ccindex WHERE crawl in ('CC-MAIN-2023-06') AND subset='warc' 3128644597 ``` 6.4B for the last two crawls: ```...

LLaMa uses a pipeline called [cc_net](https://github.com/facebookresearch/cc_net)

Steps to install CCNet on AMI `ami-0d70546e43a941d70`: ``` sudo apt install cmake sudo apt install build-essential libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libboost-test-dev make install pip install cc_net[getpy] ```

First snapshot processing failed overall, but did leave some partial output. It produces json-lines files segmented by language: ``` $ ls mined_split/2019-09/1581/ | head -10 af_all.json.gz af_all.json.gz.index als_all.json.gz als_all.json.gz.index am_all.json.gz...

Failure looks like [this issue](https://github.com/facebookresearch/cc_net/issues/36)

The line-level deduping does a great job cleaning up the text. Here is the original document: ``` Couple and Mother Charged in Ludlow Meth Bust What's Hot: High School Basketball...