trl Adding support for constitution

I'm thinking about if it would be possible for users to use a constitution to get the rewards for training. https://www.anthropic.com/constitutional.pdf

Right now I'm experimenting with either using a zero-shot bart multi-class classifier

from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

candidate_labels = ["well-being", "non-judgmental", "empathetic", "tailored", "privacy", "crisis", "ethical"]

classifier(sequence_to_classify, candidate_labels, multi_class=True)

Or using a open domain model. Screenshot 2023-02-16 at 22 19 54

Here is some example code I made in a branch. https://github.com/lvwerra/trl/commit/d785c2274f365c041f9653a4364da7ff6060aeba

I would love some feedback and I'm sure there is a better way of doing this. So think about the code as inspiration for how this could be achieved.

One thing I thought about that you could probably also do which is simpler is just use the included sentiment analysis pipeline if you wanted positive or negative texts in a style based on a dataset that available on the hub.

Feb 16 '23 21:02 BirgerMoell

Sounds really cool! Have you been able to test it already? If you have a working example then we can add it as an example! This might also be interesting to @lewtun.

Feb 21 '23 17:02 lvwerra

Closing this for now - feel free to reopen if there's an update!

Jun 01 '23 12:06 lvwerra