dolma
dolma copied to clipboard
Simplify how rules in the mixer are provided
Because of the strange requirement on how to specify logical filter rules for fields that does not exist for all documents, I looked into this behaviour and it turned out to be a bug in jsonpath-rust, which is now fixed. You may want to update to a newer version of jsonpath-rust to include this fix. https://github.com/besok/jsonpath-rust/pull/47
Dolma still depends on the broken version of jsonpath-rust (0.3.0) or older. The bugfix mentioned above is included in the latest releases. I think the oldest version with the fix included is 0.3.3. I would recommend bumping the version in the Cargo.toml. The latest version is 0.4.0.
This is nice; I will bump in the next version @peterbjorgensen! In the meantime, I recently added support for specifying rules using jq syntax (not the default, but can be used by specifying syntax: jq
, e.g.):
streams:
- name: falcon
documents:
- s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v0/documents/*
attributes:
- dedupe_para_ngrams_13_1
- pii_regex_with_counts_fast_v2
- tokenizer_repetitions_v2r2
output:
max_size_in_bytes: 3_814_697_265
path: s3://ai2-llm/pretraining-data/sources/falcon-refinedweb/v1/documents
min_text_length: 25
discard_fields:
- attributes
filter:
include:
# computes average duplication factor and only keep docs with less than 30% duplication
- >-
(.attributes.dedupe_para_ngrams_13_1 | length == 0) or
((.attributes.dedupe_para_ngrams_13_1 | map(.[2] * (.[1] - .[0])) | add) / (.text | length) <= 0.3)
exclude:
# Remove documents with more than 10 repeated ngrams
- >-
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition != null) and
(.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition[0][-1] > 10)
# PII filter
- .attributes.pii_regex_with_counts_fast_v2__pii_regex_with_counts_fast_v2__doc_count[0][-1] > 5
syntax: jq
processes: 188
cool, isn't there a .attributes
missing in the example for the exclude filters, i.e. it should be .attributes.tokenizer_repetitions_v2r2__tokenizer_repetitions_v2r2__doc_max_score_repetition
Is it also possible to filter on document metadata, such as .metadata.sub-source == "mygoodsource"
?