NeMo-Curator icon indicating copy to clipboard operation
NeMo-Curator copied to clipboard

Hard negative mining for Retriever fine-tuning

Open vinay-raman opened this issue 9 months ago • 0 comments

Description

Provides functionality to create training datasets for retriever customization

Usage

  1. Semantically cluster documents into partitions:
python3 repartition.py --input-dir=<input_directory> --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>
  1. Mine hard negatives separately for each of the partitions
python3 mine_hard_negatives.py --input-dir=<output_directory_of step1>/clustered_dataset/ --hard-negative-mining-config=<your-config-file.yaml> --output-dir=<output-directory> --api-key=<your-api-key>

Checklist

  • [x] I am familiar with the Contributing Guide.
  • [x] New or Existing tests cover these changes.
  • [x] The documentation is up to date with these changes.

vinay-raman avatar Feb 05 '25 19:02 vinay-raman