openwebtext
openwebtext copied to clipboard
Idea for further filtering
I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).
It's a fairly crude filter but I haven't seen any false positives
import re
import datasets
ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))
Samples of the docs are things like this:

Printed with
for doc in ds_filtered:
print(doc["text"].replace("\n", " | ")[:400])
print("\n")
Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.