dolma
dolma copied to clipboard
Data and tools for generating and inspecting OLMo pre-training data.
I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs Specifically how it was thread-safe ``` pub fn contains_hashes(&self, hashes: &Vec) -> bool { for hash in hashes { let hash = *hash as...
I would really love a proper contributing.md document styleguide and precommit. _Originally posted by @chris-ha458 in https://github.com/allenai/dolma/issues/23#issuecomment-1685029747_
Hi, dolma is a wonderful tool, and I m successfully using it for many steps of my pipeline. Strangely, I can manage to get it working for (paragraph-level) deduplication. When...
Hi everyone, I'm working on a research project relating to instruction following, and it would be amazing to have a language model with a guarantee that no _explicitly_ instruction-following data...
This PR adds three nice features to `BaseParallelProcessor`: - Refactors progress bar out of `parallel.py` - Adds a `PoolWithDebug` wrapper around `multiprocessing.Pool` that transparently disables multiprocessing when debugging - Uses...
#96 has already been mentioned and my version is tag 1.0.3, my command is : dolma dedupe --documents "study/samples/v0/documents//*" --dedupe.documents.attribute_name 'bff_duplicate_documents' --dedupe.documents.key "metadata.id" --dedupe.skip_empty --bloom_filter.file /tmp/deduper_bloom_filter.bin --no-bloom_filter.read_only --bloom_filter.estimated_doc_count '6_000_000' --bloom_filter.desired_false_positive_rate...
Dear authors, I tried to implement the rule on page 57 of your Dolma paper 'Remove documents with more than half of their line not ending in...'. And I modified...
Dear authors, I was trying to reimplement the Dolma-Web described in your paper. However, in the Step 2, using the dolma toolkit, I found Gopher implementation in this repo something...
Hi! Is it possible to cut a new version to PyPI. The current one installs all the optional dependencies and some of them have specific build requirements (e.g. `LTpycld2` requires...
General updates to the `dedupe` command to do deduplication using a joint paragraph/document flow in the same way that DCLM does. Nuanced update list: Bloom Filter updates: - Used better...