NeMo-Curator
NeMo-Curator copied to clipboard
Semantic Dedup doesn't work with UCX
Describe the bug
Semantic Dedup often gets stuck at the state when we call semantic_cluster_dedup.extract_dedup_data.
Steps/Code to reproduce bug
Run semantic dedup when the client = get_client(device_type='gpu', protocol='ucx')
Environment overview
Tried on cudf-cu12=24.8.* and cudf-cu12==24.10.a*
Succeeds when protocol='tcp'
Also from a quick experiment it seems like classifiers (domain / quality) are about 30% slower when using UCX.
We should try if the PR #80 Patch Distributed UCX comms to allow configuring connect timeout (docs here) help solve this issue
After substantial refactoring of Semantic Dedup, this now is no longer an issue