NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

[BUG] Semdedup Embedding Restart not working cleanly

Open VibhuJawa opened this issue 1 year ago • 1 comments
trafficstars

Describe the bug

Currently our semdedup restart mechanism for embedding is not working cleanly.

This is because of following ( add_filename=False)

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L62-L64

And write to filename is False

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/scripts/semdedup/compute_embeddings.py#L78

And get_remaining_files by default cant handle comparing files with different extensions.

https://github.com/NVIDIA/NeMo-Curator/blob/3a31ab13137e43fd8c1ffd40e07c52606a852acb/nemo_curator/utils/file_utils.py#L66-L80

VibhuJawa avatar Aug 19 '24 17:08 VibhuJawa

Happy to pair on this at some point; in general there are a couple of things I have been thinking should be refactored with DocumentDataset's read and write functions.

See: https://github.com/NVIDIA/NeMo-Curator/issues/50, https://github.com/NVIDIA/NeMo-Curator/issues/180, https://github.com/NVIDIA/NeMo-Curator/issues/293...

sarahyurick avatar Oct 14 '24 21:10 sarahyurick

@sarahyurick , I think given your PRs , you should probably just take this on. Happy to provide input as needed. Let me know what you think.

VibhuJawa avatar Oct 22 '24 23:10 VibhuJawa

I'm not sure I can reproduce this. I ran:

python compute_embeddings.py \
    --input-data-dir "my_data" \
    --input-file-type "jsonl" \
    --input-file-extension "jsonl" \
    --config-file "semdedup_config.yaml"

where my_data is a directory with 2 JSONL files. In semdedup_config.yaml, I specified a different directory as the cache_dir where the 2 resulting Parquet files were written. When I rerun without changing anything, there are no errors.

LMK if there is anything else I should be setting or changing, otherwise we can close this issue.

sarahyurick avatar Oct 25 '24 20:10 sarahyurick

NVM, the issue is that it should not rerun if the embeddings are already present.

sarahyurick avatar Oct 25 '24 21:10 sarahyurick

Closed by https://github.com/NVIDIA/NeMo-Curator/pull/327.

sarahyurick avatar Nov 06 '24 21:11 sarahyurick