data-juicer 为什么大部分的refined recipe都是用simhash去重？

为什么大部分的refined recipe都是用simhash去重？

Open sherrytonger opened this issue 1 year ago • 1 comments

Before Asking 在提问之前

[X] I have read the README carefully. 我已经仔细阅读了 README 上的操作指引。
[X] I have pulled the latest code of main branch to run again and the problem still existed. 我已经拉取了主分支上最新的代码，重新运行之后，问题仍不能解决。

Search before asking 先搜索，再提问

[X] I have searched the Data-Juicer issues and found no similar questions. 我已经在 issue列表中搜索但是没有发现类似的问题。

Question

如题，大部分业界做法都是用minhash去重，如Pile、The Stack，为什么data juicer推荐用simhash？是从计算效率上考虑（在1G数据集上，单线程，simhash的运行时间大约是minhash的一半），还是因为用simhash去重downstream task表现更好？

Additional 额外信息

No response

Apr 10 '24 14:04 sherrytonger

嗨 @sherrytonger

感谢你对data-juicer的关注与使用~

我们在初期发布的data-juicer recipes中，的确基本都是使用simhash进行去重的，最主要的原因就是你提到的效率优势，因为那时data-juicer面向的用户主要还是大部分的普通用户，他们通常只能进行单机处理，因此在处理较大的数据集的时候，高效以及资源占用较低就成了simhash的优势；我们在初期也是优先支持了simhash去重。

当然，后续我们也补充上了minhash单机去重以及分布式去重的能力，用户也可以根据自己的资源情况以及去重效果需求选择合适的去重算法~

Apr 16 '24 06:04 HYLcool

This issue is marked as stale because there has been no activity for 21 days. Remove stale label or add new comments or this issue will be closed in 3 day.

May 07 '24 09:05 github-actions[bot]

Close this stale issue.

May 10 '24 09:05 github-actions[bot]

data-juicer data-juicer copied to clipboard

为什么大部分的refined recipe都是用simhash去重？

Before Asking 在提问之前

Search before asking 先搜索，再提问

Question

Additional 额外信息

data-juicer
data-juicer copied to clipboard