NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Semantic Dedup doesn't work with UCX

Open praateekmahajan opened this issue 1 year ago • 1 comments
trafficstars

Describe the bug

Semantic Dedup often gets stuck at the state when we call semantic_cluster_dedup.extract_dedup_data.

Steps/Code to reproduce bug

Run semantic dedup when the client = get_client(device_type='gpu', protocol='ucx')

Environment overview

Tried on cudf-cu12=24.8.* and cudf-cu12==24.10.a*

Succeeds when protocol='tcp'

praateekmahajan avatar Oct 08 '24 00:10 praateekmahajan

Also from a quick experiment it seems like classifiers (domain / quality) are about 30% slower when using UCX.

praateekmahajan avatar Oct 08 '24 00:10 praateekmahajan

We should try if the PR #80 Patch Distributed UCX comms to allow configuring connect timeout (docs here) help solve this issue

praateekmahajan avatar Feb 04 '25 18:02 praateekmahajan

After substantial refactoring of Semantic Dedup, this now is no longer an issue

praateekmahajan avatar May 14 '25 21:05 praateekmahajan