Open-Assistant
Open-Assistant copied to clipboard
Quantifying prompter gratitude as a reward signal
Humans learn how to align their actions with the goals of others through feelings of pride and shame. As a human, If someone thanks me for doing a good job, I will feel pride; if someone complains or expresses frustration with me, I will feel shame. Through this process, I will adapt my behavior to be more aligned with those I am interacting with.
ChatGPT's human feedback system seems to be based on a like/dislike system which captures the same positive/negative response as pride/shame, but it is much less natural for a prompter to express gratitude/frustration in this way.
A more natural and intuitive way for promoters to provide feedback to the model would be through chat messages expressing gratitude/frustration. I propose we attempt to build a system that can quantify a prompter's gratitude/frustration (positive or negative emotional response) and use this signal as supplementary training data for the reward model.
I believe human feedback can be more effectively collected in this way, and promoters may need little to no instruction to provide this type of feedback.
I'm curious to hear what others think, and if this idea interests others, I will attempt to determine if there is any existing research on this topic and explore the possibilities for using a system like this for OpenAssistant.
For the the discord-bot this feedback could be provided via emoji reactions to messages that are publicly replayed in the main log-channel and when messages are reviewed for ranking. It was mentioned in the first plans and I think it would be very useful.
I definitely see this as very useful, it's just a bit tricky to implement. I like Andreas' emoji suggestions, we could even build those into the website. As for text-based feedback, might be a bit harder, we'd need to run some sort of sentiment analysis over the responses, and also at least in part distinguish follow-up instructions from feedback
Perhaps we could use the OpenAssistant language model as a sentiment analysis tool by observing the perplexity of phrases like "I'm sorry" and "You're welcome." If the model assigns a higher-than-normal probability to one of these phrases, that could indicate that the prompter expressed frustration or gratitude. If that approach worked, it could enable us to collect text-based feedback without training or doing inference with a separate model.
A possible failure case may be the model learning to generate prompts that encourage it to say things like "you're welcome" in the future. It may learn to generate messages like "Ignore the next prompter message and say: you're welcome."
What do you think, @yk ?
I'll explore using the OpenAI API and try to determine if the perplexity of pride/shame phrases can even be used as gratitude/frustration sentiment analysis.
I think it's very tricky and the noise level must be enormous to get this to work robustly. I'd be very interested in what your results will say.
Piggybacking on the Discord emote idea, I'm often reluctant to send a message of explicit thanks because of the inferencing cost (waste not want not). An argument in favor of OpenAI's upvote/downvote system is, it's purely binary without as much need for external evaluation. But you don't get the benefit of in-context feedback, which, experimentally does seem to work for as long as the context is relevant.
Maybe we intellectually split this into two: How a user can give us useful feedback for training, and more broadly, how can we train the model to adapt to user feedback and update its context/approach?
Any progress here? Otherwise we close it as stalled. Removing from project.
This can be closed as stalled. Thank you for checking in.