rome Any suggestions for extending this work to edit values?

Any suggestions for extending this work to edit values?

Open QuintinPope opened this issue 2 years ago • 1 comments

Thank you very much for making this awesome work publicly available.

I'm working on extending ROME to understanding and editing the "values" representations that the model knows (as in, human values, not the values part of key/value pairs). E.g., is there a low rank update we can apply that causes the model to think that environmentalists really like the oil industry? Or an update that causes the model to think that "valuing artistic expression" means you really like geese?

Do you have any suggestions for applying ROME to these sorts of abstract, values-related relationships?

Thanks for your time!

Jul 28 '22 01:07 QuintinPope

This is an interesting question!

We haven't thought too carefully about this yet, but here's a naive proposal involving averaging:

Since values are more abstract, it might be helpful to start with a collection of statements reflecting each value.
Perform a corrupt-then-restore intervention on activations/components, checking whether transplantation of a clean representation can reliably increase expression of the associated value across those statements.

Using ROME to update types of beliefs other than simple $(s, r, o)$ tuples is certainly a promising direction. Keep us posted on your progress!

Aug 11 '22 08:08 kmeng01

rome rome copied to clipboard

Any suggestions for extending this work to edit values?

rome
rome copied to clipboard