Please add tutorial notebooks that demonstrate Exact deduplication and Fuzzy deduplication
Is your feature request related to a problem? Please describe. The current tutorial notebook only covers Semantic Deduplication. The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.
Describe the solution you'd like Please add tutorial notebooks that demonstrate Exact deduplication and Fuzzy deduplication. This would help users understand how to apply these methods.
Describe alternatives you've considered
Additional context
Thanks for opening. The tutorials are something we're working on and will open a PR for soon.
The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.
Are you unsure about this in the context of Fuzzy/Exact deduplication or do you have any questions about using this in Semantic Deduplication as well?
Thanks for opening. The tutorials are something we're working on and will open a PR for soon.
The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.
Are you unsure about this in the context of Fuzzy/Exact deduplication or do you have any questions about using this in Semantic Deduplication as well?
Thank you very much for your reply. The upcoming PR sounds great — I really appreciate the update.
I’m currently unsure about how to use it for Fuzzy/Exact deduplication, but I was able to understand the usage for Semantic Deduplication, as the tutorial notebooks explain it clearly.
This may be related to Issue #1216
I'm not sure this is just a matter of documentation. I've spent two days trying to get the fuzzy dedup workflow to succeed, using the example straight from the existing docs and NVIDIA's own docker image. It fails immediately, for what appear to be fundamental reasons. It's almost as if the software is completely broken.
Starting adding this in #1242. Will followup with Exact in a different PR.