data-juicer
data-juicer copied to clipboard
[NewOp] Add group_diversity_filter op
As the title says, this op calculates the in-group diversity for a batch of samples.
Here's the breakdown:
- It first converts all input samples into embedding vectors.
- Then, it calculates the cosine similarity of each sample against the average embedding of the whole group.
- Finally, it normalizes these similarities to produce
stat "text_ebd_diversity_score"for each sample.
This op can support the diversity reward shaping in Trinity-RFT.
[Note] Since this op needs to see all samples to calculate a single group average, the num_proc (np) must be set to 1.
[TODO] This op may need to handle the input data more dynamically, especially when dealing with batches of prompt-rollouts from a Trinity Buffer.