data-preparation
data-preparation copied to clipboard
Why stopwords_min_cutoff rather than stopwords_max_cutoff?
Thanks for your helpful codebase!
I am a bit confused about stop words filtering
.
The release code removes the document, if its stop words ratio below the certain cutoff.
https://github.com/bigscience-workshop/data-preparation/blob/9d0588419073cc5bf0fb92b58f37f2a1016572c3/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py#L590
But in notebook, section 2.5 states If the stop words ratio for a document is higher than a certain cutoff, it is removed.
I am wondering which one is more useful in your practice. Thanks in advance!