NeMo-Curator
                                
                                
                                
                                    NeMo-Curator copied to clipboard
                            
                            
                            
                        [BUG] Semdedup Embedding Restart not working cleanly
Describe the bug
Currently our semdedup restart mechanism for embedding is not working cleanly.
This is because of following ( add_filename=False)
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64
And write to filename is False
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78
And get_remaining_files  by default cant handle comparing files with different extensions.
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/utils/file_utils.py#L66-L80