data-juicer icon indicating copy to clipboard operation
data-juicer copied to clipboard

[NewOp] Add group_diversity_filter op

Open lingzhq opened this issue 4 months ago • 0 comments

As the title says, this op calculates the in-group diversity for a batch of samples.

Here's the breakdown:

  • It first converts all input samples into embedding vectors.
  • Then, it calculates the cosine similarity of each sample against the average embedding of the whole group.
  • Finally, it normalizes these similarities to produce stat "text_ebd_diversity_score" for each sample.

This op can support the diversity reward shaping in Trinity-RFT.

[Note] Since this op needs to see all samples to calculate a single group average, the num_proc (np) must be set to 1.

[TODO] This op may need to handle the input data more dynamically, especially when dealing with batches of prompt-rollouts from a Trinity Buffer.

lingzhq avatar Jul 22 '25 07:07 lingzhq