dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Data and tools for generating and inspecting OLMo pre-training data.

Results 22 dolma issues
Sort by recently updated
recently updated
newest added

I am currently working with a dataset and noticed the term "C4 NoPunc" used in the context of data quality filtering. I would like to clarify what exactly this term...

I am trying to run paragraph level deduplication using the dolma library and wanted to test it on c4. I downloaded `allenai/c4` from huggingface, updated the schema to be `text...

Hi, I encountered some problems when running ``` pip install dolma ``` At first, the error message prompts me to install Rust. After I install Rust and set in the...

Hi, While downloading and processing Dolma v1.7, I noticed that there are many duplicate samples with the same `id` field in the dataset. E.g. in the `Project Gutenberg` source, there...

Working on creating data with dolma v1.5 style decontamination from baseline datasets. Progress so far is commented below.

I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining...