data-juicer [NewOp] Add group_diversity

[NewOp] Add group_diversity_filter op

Open lingzhq opened this issue 4 months ago • 0 comments

As the title says, this op calculates the in-group diversity for a batch of samples.

Here's the breakdown:

It first converts all input samples into embedding vectors.
Then, it calculates the cosine similarity of each sample against the average embedding of the whole group.
Finally, it normalizes these similarities to produce stat "text_ebd_diversity_score" for each sample.

This op can support the diversity reward shaping in Trinity-RFT.

[Note] Since this op needs to see all samples to calculate a single group average, the num_proc (np) must be set to 1.

[TODO] This op may need to handle the input data more dynamically, especially when dealing with batches of prompt-rollouts from a Trinity Buffer.

Jul 22 '25 07:07 lingzhq

data-juicer data-juicer copied to clipboard

[NewOp] Add group_diversity_filter op

data-juicer
data-juicer copied to clipboard