rome
rome copied to clipboard
Any suggestions for extending this work to edit values?
Thank you very much for making this awesome work publicly available.
I'm working on extending ROME to understanding and editing the "values" representations that the model knows (as in, human values, not the values part of key/value pairs). E.g., is there a low rank update we can apply that causes the model to think that environmentalists really like the oil industry? Or an update that causes the model to think that "valuing artistic expression" means you really like geese?
Do you have any suggestions for applying ROME to these sorts of abstract, values-related relationships?
Thanks for your time!
This is an interesting question!
We haven't thought too carefully about this yet, but here's a naive proposal involving averaging:
- Since values are more abstract, it might be helpful to start with a collection of statements reflecting each value.
- Perform a corrupt-then-restore intervention on activations/components, checking whether transplantation of a clean representation can reliably increase expression of the associated value across those statements.
Using ROME to update types of beliefs other than simple $(s, r, o)$ tuples is certainly a promising direction. Keep us posted on your progress!