dolma icon indicating copy to clipboard operation
dolma copied to clipboard

Data and tools for generating and inspecting OLMo pre-training data.

Results 22 dolma issues
Sort by recently updated
recently updated
newest added

``` Traceback (most recent call last): File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all multiprocessing.set_start_method("spawn") File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method raise RuntimeError('context has already been set') RuntimeError: context has already been...

Hello everyone, I am currently using SageMaker connected to an S3 Bucket. I successfully downloaded data and obtained tagging results with Dolma without encountering any issues. However, during the final...

My single computer is not powerful enough to run Dolma :(

Even with the latest git version some of the URL taggers crash if I run the taggers with multiprocessing. I can't figure out where this race condition happens. If I...

After using commad ``` dolma tokens \ --documents "dataset/${data_source}_add_id" \ --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \ --destination dataset/${data_source}_npy \ --tokenizer.eos_token_id 151643\ --tokenizer.pad_token_id 151646 \ --dtype "uint32" \ --processes 20 ``` I use the...

Hi, Thank you for sharing this outstanding repository! I have been trying to use `scripts/make_wikipedia_py` to process a German wikipedia dump: ``` python scripts/make_wikipedia.py --output wikipedia --lang de --date 20240201...

@IanMagnusson asks > I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant...

enhancement

While running taggers on the hplt dataset, I encountered a problem that means that the `not_alphanum_paragraph_v1` stalls forever. In order to debug the problem I have created a minimum working...

Add mixer configuration to trim trailing/leading whitespace from document text, and enforce a minimum document text length. Place these into a new `text_modification` config object, and move the `span_replacements` config...

Currently, bloom_filter.rs implements ahash for the internal hasher. This is problematic since ahash has an [unstable representation](https://github.com/tkaitchuck/aHash#goals-and-non-goals): > **different computers or computers on different versions of the code will observe...