data-juicer stopwords_filter 为什么是过滤掉小于某个阈值的样本

stopwords_filter 为什么是过滤掉小于某个阈值的样本

Open noforit opened this issue 1 year ago • 2 comments

Before Asking 在提问之前

[X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

[X] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

有一个疑惑，停用词含量越高不应该才是越无意义的样本吗，为什么这个fiter是过滤掉小于某个阈值的样本？麻烦解答一下😂，或者说我的理解有什么偏差？

Additional 额外信息

No response

Apr 25 '24 08:04 noforit

嗨 @noforit ，感谢你对data-juicer的关注与使用！

stopwords_filter在实现时的本意为根据停用词比例筛除一些搜索引擎处理过的文本。一般情况下，搜索引擎为了提升搜索效率等原因，会将一个文档中的停用词删除后再建立索引，但删除停用词的文档的语义信息会被破坏，在LLM的训练数据中会被认为是质量相对较低的文本。所以这个算子会将停用词比例较低的样本过滤掉。

但你说的也是正确的，停用词比例较高的样本也是质量较低的。stopwords_filter算子其实功能上有一个互补的算子，叫flagged_words_filter，它的本意是将敏感词比例过高的样本滤除。它们俩都可以指定词表，因此它们的更广泛用法为将感兴趣的某类词的比例过高或者过低的样本滤除。比如针对你所说的情况，我们可以加上一个flagged_words_filter，词表设置为停用词词表，这时就会把停用词比例过高的样本也筛除了。

如你还有进一步的疑问，欢迎随时与我们交流~

Apr 25 '24 12:04 HYLcool

@HYLcool 好的好的，谢谢你

Apr 25 '24 12:04 noforit

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

May 17 '24 09:05 github-actions[bot]

Close this stale issue.

May 21 '24 09:05 github-actions[bot]

data-juicer data-juicer copied to clipboard

stopwords_filter 为什么是过滤掉小于某个阈值的样本

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

data-juicer
data-juicer copied to clipboard