dolma
dolma copied to clipboard
Data and tools for generating and inspecting OLMo pre-training data.
I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term...
I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded `allenai/c4` from huggingface, updated the schema to be `text...
Hi, I encountered some problems when running ``` pip install dolma ``` At first, the error message prompts me to install Rust. After I install Rust and set in the...
Hi, While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same `id` field in the dataset. E.g. in the `Project Gutenberg` source, there...
Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.
I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining...