Open-Assistant
Open-Assistant copied to clipboard
Making sure credence is properly calibrated and conveyed using hedge words
One of the complaints about ChatGPT is it's overconfidence. ChatGPT is probably better than previous assistants in this regard, but I have an idea for how open assistant might be able do better!
- We train a small model on predicting how confident a claim is in terms of credence, which is essentially "How likely does the speaker think this is true, as a probability?". A potential starting point is the research described here.
- The assistant's neural network will generate a credence for each claim. We train these two aspects (at the same time as the RLHF step):
- The wording of the claim is such that it's credence as judged by the model in (1) is close to the credence generated by the assistance.
- The credence itself is calibrated. This means, for example, the 80% of the assistant's 80% credence claims will be correct. The scoring rule is just log(p) if the claim is true and log(1-p) if the claim is false (i.e. it is just cross entropy). (We ask the human if the claim is true or not during the human feedback phase.) This scoring rule incentives both better knowledge but also more accurate credence.
The important bit about credence calibration is that it gives a very large punishment if a high credence claim is incorrect. So even though humans typically prefer confident claims, the assistant still learns to hedge it's bets to avoid the possibility of a large credence penality. (The reward for correct claims is slightly higher for high credences though, so it's still optimal to give high credence to obvious claims.)
(A question is whether we treat the entire response as a single claim, or split it up using NLP (perhaps part of the model in (1)).)
The big risk of this is creating extremely compelling dis/misinformation. Why would this not converge rapidly on "telling people what they want to hear" as opposed to what "is factual". If you can implement it and show some promising results that would be great, but do you have any theoretical basis for why you think the users will be good at determining factuality without resorting to looking things up on their own?
@Rallio67
why you think the users will be good at determining factuality without resorting to looking things up on their own?
To clarify, this isn't being done by end users, it's being done during training. The contributors would be looking things up, doing research. If they can't figure it out, I guess you exclude it from the training set? And anything that is overly subjective would probably be excluded (since the assistant shouldn't be making many subjective claims anyways).
Accuracy is one of the overarching goals of the open assistant anyways, and needs to be solved somehow. This idea is just to tack credence onto that accuracy system, so the assistant conveys uncertainty in a natural way.
If the accuracy is achieved with something other than human feedback, credence would be trained by connecting to that system instead.