text-dedup
text-dedup copied to clipboard
All-in-one text de-duplication
text-dedup
Features
- Hash-based methods such as SimHash, MinHash + LSH for near deduplication.
- SuffixArray-based method from Deduplicating Training Data Makes Language Models Better for substring exact deduplication.
- In-memory or Redis/KeyDB-cached index to handle larger than memory datasets.
Documentation
CLI Usage
cli.py
is a wrapper tool that identifies duplicates for a given Huggingface's dataset. Currently, only hash-based methods will try to identify all duplicates within the dataset and the suffix array method will only find the duplicate substrings within dataset splits.
By default, the tool uses redis as a cache layer for the hashes. See configs/method/minhash.yaml
or configs/method/simhash.yaml
for details. Or you can overwrite the storage_config
to null
to use in-memory index. Deduplicating small datasets that fit in your machine's memory should be fine with in-memory index.
python cli.py method=suffix method.dataset=oscar-corpus/OSCAR-2201 method.configs="[gl]"
python cli.py method=simhash method.tokenization.ngram_size=12 method.dataset=oscar-corpus/OSCAR-2201 method.configs="[gl]"
python cli.py method=minhash method.tokenization.ngram_size=12 method.dataset=oscar-corpus/OSCAR-2201 method.configs="[gl]"
- Configurations are parsed with hydra.
Programmatic Usage
Hash-based Near Deduplication
from text_dedup.embedders.minhash import MinHashEmbedder
from text_dedup.utils.nn import lsh_clustering
from text_dedup.utils.group import get_group_indices
corpus = [
"The quick brown fox jumps over the lazy dog",
"The quick brown fox jumps over the lazy dog",
"This is a test",
"This is a test",
]
embedder = MinHashEmbedder()
embeddings = embedder.embed(corpus)
clusters = lsh_clustering(embeddings)
groups = get_group_indices(clusters)
print(groups)
# [0, 0, 2, 2]
from text_dedup.embedders.simhash import SimHashEmbedder
from text_dedup.utils.nn import simhash_clustering
from text_dedup.utils.group import get_group_indices
corpus = [
"The quick brown fox jumps over the lazy dog",
"The quick brown fox jumps over the lazy dog",
"This is a test",
"This is a test",
]
embedder = SimHashEmbedder()
embeddings = embedder.embed(corpus)
clusters = simhash_clustering(embeddings)
groups = get_group_indices(clusters)
print(groups)
# [0, 0, 2, 2]
Suffix Array Substring Exact Deduplication
from text_dedup.embedders.suffix import SuffixArrayEmbedder
corpus = [
"The quick brown fox jumps over the lazy dog",
"The quick brown fox jumps over the lazy dog",
"This is a test",
"This is a test",
"This is a random test",
"The quick brown fox and a random test"
]
embedder = SuffixArrayEmbedder(k=10)
slices = embedder.embed(corpus, merge=True, merge_strategy='longest')
# or using the original rust code
# slices = embedder.embed_bash(corpus)
for sentence, intervals in zip(corpus, slices):
print(sentence)
print([sentence[slice] for slice in intervals])
# The quick brown fox jumps over the lazy dog
# ['The quick brown fox jumps over the lazy dog']
# The quick brown fox jumps over the lazy dog
# ['The quick brown fox jumps over the lazy dog']
# This is a test
# ['This is a test']
# This is a test
# ['This is a test']
# This is a random test
# ['This is a ', ' a random test']
# The quick brown fox and a random test
# ['The quick brown fox ', ' a random test']
Transformer Embedding Semantic Deduplication
from text_dedup.embedders.transformer import TransformerEmbedder
from text_dedup.utils.nn import annoy_clustering
from text_dedup.utils.group import get_group_indices
from transformers import AutoTokenizer, AutoModelForSequenceClassification
corpus = [
"The quick brown fox jumps over the dog",
"The quick brown fox jumps over the corgi",
"This is a test",
"This is a test message",
]
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
embedder = TransformerEmbedder(tokenizer, model)
embeddings = embedder.embed(corpus)
clusters = annoy_clustering(embeddings, f=768)
groups = get_group_indices(clusters)
print(groups)
# [0, 0, 2, 2]
Best Fuzzy Search
This is useful for ad-hoc fuzzy substring search. Given a long document and a query string, this function will return a best fuzzy match based on Jaccard similarity.
from text_dedup.utils.search import best_fuzzy_search
best_fuzzy_search("Hello world!", "Random word, Hello word! hello menudo!")
# (13, 'Hello word!')
Benchmarks
Todos
- [ ] Wrap suffix array inter-split deduplication
- [ ] Wrap inter-dataset deduplication
- [ ] Rewrite suffix array in Python
Thanks
This project is heavily influenced by the deduplication work at BigScience workshop. The original code can be found at bigscience-workshop/data-preparation.