NeMo-Curator
NeMo-Curator copied to clipboard
Hard negative mining for Retriever fine-tuning
Description
Provides functionality to create training datasets for retriever customization
Usage
- Semantically cluster documents into partitions:
python3 repartition.py --input-dir=<input_directory> --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>
- Mine hard negatives separately for each of the partitions
python3 mine_hard_negatives.py --input-dir=<output_directory_of step1>/clustered_dataset/ --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>
Checklist
- [x] I am familiar with the Contributing Guide.
- [x] New or Existing tests cover these changes.
- [x] The documentation is up to date with these changes.