dolma issues

Results 22 dolma issues

Sort by recently updated

make_wikipedia.py fails on linux

``` Traceback (most recent call last): File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all multiprocessing.set_start_method("spawn") File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method raise RuntimeError('context has already been set') RuntimeError: context has already been...

peterbjorgensen

S3 mixer doesn't start

Hello everyone, I am currently using SageMaker connected to an S3 Bucket. I successfully downloaded data and obtained tagging results with Dolma without encountering any issues. However, during the final...

marcopasqua

Is there a way to intergratge Dolma toolkit to Spark?

My single computer is not powerful enough to run Dolma :(

DangoWang

Some race condition in url taggers

Even with the latest git version some of the URL taggers crash if I run the taggers with multiprocessing. I can't figure out where this race condition happens. If I...

peterbjorgensen

Data out of bounds when using ‘dolma tokens --dtype uint32’

After using commad ``` dolma tokens \ --documents "dataset/${data_source}_add_id" \ --tokenizer.name_or_path Qwen/Qwen1.5-7B-Chat \ --destination dataset/${data_source}_npy \ --tokenizer.eos_token_id 151643\ --tokenizer.pad_token_id 151646 \ --dtype "uint32" \ --processes 20 ``` I use the...

Jackwaterveg

make_wikipedia.py: long running time

Hi, Thank you for sharing this outstanding repository! I have been trying to use `scripts/make_wikipedia_py` to process a German wikipedia dump: ``` python scripts/make_wikipedia.py --output wikipedia --lang de --date 20240201...

chschroeder

Support providing streams into mixer via CLI

@IanMagnusson asks > I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant...

soldni

enhancement

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

While running taggers on the hplt dataset, I encountered a problem that means that the `not_alphanum_paragraph_v1` stalls forever. In order to debug the problem I have created a minimum working...

peterbjorgensen

Text modification config

Add mixer configuration to trim trailing/leading whitespace from document text, and enforce a minimum document text length. Place these into a new `text_modification` config object, and move the `span_replacements` config...

rodneykinney

Change bloom_filter implementation of hash

Currently, bloom_filter.rs implements ahash for the internal hasher. This is problematic since ahash has an [unstable representation](https://github.com/tkaitchuck/aHash#goals-and-non-goals): > **different computers or computers on different versions of the code will observe...

chris-ha458

dolma
dolma copied to clipboard

Metadata

make_wikipedia.py fails on linux

S3 mixer doesn't start

Is there a way to intergratge Dolma toolkit to Spark?

Some race condition in url taggers

Data out of bounds when using ‘dolma tokens --dtype uint32’

make_wikipedia.py: long running time

Support providing streams into mixer via CLI

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

Text modification config

Change bloom_filter implementation of hash

← Metadata

Owner

Metadata

dolma dolma copied to clipboard

Metadata

← Metadata

Owner

Metadata

dolma
dolma copied to clipboard