Curator icon indicating copy to clipboard operation
Curator copied to clipboard

Please add tutorial notebooks that demonstrate Exact deduplication and Fuzzy deduplication

Open popiemon opened this issue 2 months ago • 3 comments

Is your feature request related to a problem? Please describe. The current tutorial notebook only covers Semantic Deduplication. The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.

Describe the solution you'd like Please add tutorial notebooks that demonstrate Exact deduplication and Fuzzy deduplication. This would help users understand how to apply these methods.

Describe alternatives you've considered

Additional context

popiemon avatar Nov 06 '25 02:11 popiemon

Thanks for opening. The tutorials are something we're working on and will open a PR for soon.

The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.

Are you unsure about this in the context of Fuzzy/Exact deduplication or do you have any questions about using this in Semantic Deduplication as well?

ayushdg avatar Nov 06 '25 19:11 ayushdg

Thanks for opening. The tutorials are something we're working on and will open a PR for soon.

The documentation suggests using the TextDuplicatesRemovalWorkflow for removing duplicates, but the information is insufficient and I cannot figure out how to use it.

Are you unsure about this in the context of Fuzzy/Exact deduplication or do you have any questions about using this in Semantic Deduplication as well?

Thank you very much for your reply. The upcoming PR sounds great — I really appreciate the update.
I’m currently unsure about how to use it for Fuzzy/Exact deduplication, but I was able to understand the usage for Semantic Deduplication, as the tutorial notebooks explain it clearly.

popiemon avatar Nov 07 '25 00:11 popiemon

This may be related to Issue #1216

I'm not sure this is just a matter of documentation. I've spent two days trying to get the fuzzy dedup workflow to succeed, using the example straight from the existing docs and NVIDIA's own docker image. It fails immediately, for what appear to be fundamental reasons. It's almost as if the software is completely broken.

jmcmanus15 avatar Nov 09 '25 02:11 jmcmanus15

Starting adding this in #1242. Will followup with Exact in a different PR.

ayushdg avatar Dec 02 '25 18:12 ayushdg