openwebtext icon indicating copy to clipboard operation
openwebtext copied to clipboard

Idea for further filtering

Open davidgilbertson opened this issue 2 years ago • 0 comments

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))

Samples of the docs are things like this:

image

Printed with

for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

davidgilbertson avatar Mar 05 '23 01:03 davidgilbertson