representation-engineering
representation-engineering copied to clipboard
Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_`
Currently, after training the rep_reader
, the coeff
variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions
as a example, here's the values I found:
# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2
This makes it challenging for RepControl to adapt to new models.
My finding is that by introducing the pca_model's explained_variance_ratio_
into control progress can make the manipulation progress more "gentle" / "accurate".
Here's the key modifications: In the rep_readers.py :
def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
"""Get PCA components for each layer"""
directions = {}
# like directions, save the variance ratio for each layer
variance_ratio = {}
for layer in hidden_layers:
........
self.n_components = pca_model.n_components_
variance_ratio[layer] = pca_model.explained_variance_ratio_
self.variance_ratio = variance_ratio
return directions
Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.
So, when manipulating the output, the activation
variable is calculated as:
coeff=0.2
coeff_with_variance = 2.0
activations = {}
activations_with_variance = {}
for layer in layer_id:
activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
variance_ratio = rep_reader.variance_ratio[layer][0]
# print(variance_ratio)
activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()
Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.
Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach. So, I just take the variance_ratio into account in a most simple way. Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.
Thanks for sharing this great work!