NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[BUG] Semdedup Embedding Restart not working cleanly

Open VibhuJawa opened this issue 1 year ago • 1 comments

Describe the bug

Currently our semdedup restart mechanism for embedding is not working cleanly.

This is because of following ( add_filename=False)

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64

And write to filename is False

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78

And get_remaining_files by default cant handle comparing files with different extensions.

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/utils/file_utils.py#L66-L80

VibhuJawa avatar Aug 19 '24 17:08 VibhuJawa