NeMo-Curator Semantic Dedup doesn't work with UCX

Semantic Dedup doesn't work with UCX

Open praateekmahajan opened this issue 1 year ago • 1 comments

trafficstars

Describe the bug

Semantic Dedup often gets stuck at the state when we call semantic_cluster_dedup.extract_dedup_data.

Steps/Code to reproduce bug

Run semantic dedup when the client = get_client(device_type='gpu', protocol='ucx')

Environment overview

Tried on cudf-cu12=24.8.* and cudf-cu12==24.10.a*

Succeeds when protocol='tcp'

Oct 08 '24 00:10 praateekmahajan

Also from a quick experiment it seems like classifiers (domain / quality) are about 30% slower when using UCX.

Oct 08 '24 00:10 praateekmahajan

Feb 04 '25 18:02 praateekmahajan

After substantial refactoring of Semantic Dedup, this now is no longer an issue

May 14 '25 21:05 praateekmahajan