FlagEmbedding I'm trying HNM via hn_mine.py, but the hard negatives are gibberish.

Hi, I'm trying to do HNM via hn_mine.py. The dataset exists as below:

# sample.jsonl (120k rows)
{
    "query": "사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의가 있어야 하나, 법원 ...(omitted)
    "pos": "아닙니다. 사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의 ...(omitted)
}

python hn_mine.py \
--input_file sample.jsonl \
--output_file sample_output.jsonl \
--range_for_sampling 2-30 \
--negative_number 5 \
--use_gpu_for_searching \
--embedder_name_or_path .../models/bge-m3  \ (downloaded via hugging face git clone (BAAI/BGE-m3))
--embedder_model_class encoder-only-m3 \ (or none, tried both).

However, the following Hard Negative dataset was extracted:

{
    "query": "사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의가 있어야 하나, 법원 ...(omitted)
    "pos": "아닙니다. 사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의 ...(omitted)
    "neg": [
        "初",
        "샹",
        "듭",
        "試",
        "ち"
    ]
}

My dataset is fully natural language data. How can I solve this problem?

Feb 28 '25 04:02 seongjiko

The pos and neg samples in the dataset should be stored in list format, so that the retrieved hard negatives will be complete sentences. Otherwise, if strings are passed in, they will be split by characters.

Mar 12 '25 02:03 545999961

The pos and neg samples in the dataset should be stored in list format, so that the retrieved hard negatives will be complete sentences. Otherwise, if strings are passed in, they will be split by characters.

Thanks for your answer! Specifically, do you mean I should change the example I posted to the following?

{
    "query": "사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의가 있어야 하나, 법원 ...(omitted)
    "pos": ["아닙니다. 사채권자가 자본금 감소에 대하여 이의를 제기하려면 사채권자집회의 결의 ...(omitted)"]
}

My dataset doesn't have any 'neg' data.

Mar 20 '25 11:03 seongjiko