FlagEmbedding Threshold for unlabeled data

There is a paragraph like this on page 4 of your technical report:

The text pairs selected from the web and other public sources are not guaranteed to be closely related. related. Therefore, data quality can be a major concern. In our work, we use a simple strategy to filter the data before adding it to C-MTP (unlabeled). Particularly, we use a third-party model: Text2VecChinese2 to score the strength of relation for each text pair. We empirically choose a threshold of 0.43, and drop the samples scores whose are below the threshold. With such an operation, there are 100 million text pairs filtered from the unlabeled corpora. Despite the simplicity, we find that it effectively removes the irrelevant text pairs when manually review samples and leads to strong Empirical performances for the models trained C-MTP (unlabeled).

I have a bit question as to how you choose the number 0.43. Is there a qualitative method for this selection?

Nov 10 '23 15:11 iambestfeeddddd

Only analyze some cases and select a threshold, that may be not the best value.

Nov 13 '23 02:11 staoxiao

Only analyze some cases and select a threshold, that may be not the best value.

Hmm, I'm a bit confused about how many epochs we need to finetune for unlabeled data. Is 3 a good number or do we need more?

Nov 13 '23 04:11 iambestfeeddddd

@staoxiao hmm, i have a new question. Do you perform dedup operations on wiki and bookcorpus data? And if there is dedup, I wonder if we should do it after splitting the data into traning sample or dedup with default sample in the raw dataset (before building the dataset).

Jan 26 '24 06:01 iambestfeeddddd

@staoxiao hmm, i have a new question. Do you perform dedup operations on wiki and bookcorpus data? And if there is dedup, I wonder if we should do it after splitting the data into traning sample or dedup with default sample in the raw dataset (before building the dataset).

No, I didn't perform dedup operation.

Jan 26 '24 09:01 staoxiao