data-preparation icon indicating copy to clipboard operation
data-preparation copied to clipboard

Why stopwords_min_cutoff rather than stopwords_max_cutoff?

Open longxudou opened this issue 2 months ago • 0 comments

Thanks for your helpful codebase!

I am a bit confused about stop words filtering. The release code removes the document, if its stop words ratio below the certain cutoff. https://github.com/bigscience-workshop/data-preparation/blob/9d0588419073cc5bf0fb92b58f37f2a1016572c3/preprocessing/training/01b_oscar_cleaning_and_filtering/filtering.py#L590 But in notebook, section 2.5 states If the stop words ratio for a document is higher than a certain cutoff, it is removed.

I am wondering which one is more useful in your practice. Thanks in advance!

longxudou avatar Apr 18 '24 10:04 longxudou