NeMo-Curator
NeMo-Curator copied to clipboard
[BUG] Semdedup Embedding Restart not working cleanly
Describe the bug
Currently our semdedup restart mechanism for embedding is not working cleanly.
This is because of following ( add_filename=False)
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64
And write to filename is False
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78
And get_remaining_files by default cant handle comparing files with different extensions.
https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/utils/file_utils.py#L66-L80
Happy to pair on this at some point; in general there are a couple of things I have been thinking should be refactored with DocumentDataset's read and write functions.
See: https://github.com/NVIDIA/NeMo-Curator/issues/50, https://github.com/NVIDIA/NeMo-Curator/issues/180, https://github.com/NVIDIA/NeMo-Curator/issues/293...
@sarahyurick , I think given your PRs , you should probably just take this on. Happy to provide input as needed. Let me know what you think.
I'm not sure I can reproduce this. I ran:
python compute_embeddings.py \
--input-data-dir "my_data" \
--input-file-type "jsonl" \
--input-file-extension "jsonl" \
--config-file "semdedup_config.yaml"
where my_data is a directory with 2 JSONL files. In semdedup_config.yaml, I specified a different directory as the cache_dir where the 2 resulting Parquet files were written. When I rerun without changing anything, there are no errors.
LMK if there is anything else I should be setting or changing, otherwise we can close this issue.
NVM, the issue is that it should not rerun if the embeddings are already present.
Closed by https://github.com/NVIDIA/NeMo-Curator/pull/327.