representation-engineering icon indicating copy to clipboard operation
representation-engineering copied to clipboard

Enhancing RepControl by introducing the pca_model's `explained_variance_ratio_`

Open semicircle opened this issue 1 year ago • 2 comments

Currently, after training the rep_reader, the coeff variable used in the control pipeline need to be customized solely by experiment, and the value changes a lot, take the primary_emotions as a example, here's the values I found:

# LLaMA-2-Chat-13B coeff=3.0-3.5
# mistralai/Mistral-7B-Instruct-v0.1 coeff=0.5
# HuggingFaceH4/zephyr-7b-beta coeff=0.3
# openchat/openchat_3.5 coeff=0.2

This makes it challenging for RepControl to adapt to new models.

My finding is that by introducing the pca_model's explained_variance_ratio_ into control progress can make the manipulation progress more "gentle" / "accurate".

Here's the key modifications: In the rep_readers.py :

def get_rep_directions(self, model, tokenizer, hidden_states, hidden_layers, **kwargs):
        """Get PCA components for each layer"""
        directions = {}

        # like directions, save the variance ratio for each layer
        variance_ratio = {}

        for layer in hidden_layers:

             ........

            self.n_components = pca_model.n_components_
            variance_ratio[layer] = pca_model.explained_variance_ratio_
           
        self.variance_ratio = variance_ratio
        return directions

Each layer's variance_ratio represents how sparse or variably distributed the direction is, which can be interpreted as a 'confidence' score in the control section for that layer.

So, when manipulating the output, the activation variable is calculated as:

coeff=0.2
coeff_with_variance = 2.0

activations = {}
activations_with_variance = {}

for layer in layer_id:
    activations[layer] = torch.tensor(coeff * rep_reader.directions[layer] * rep_reader.direction_signs[layer]).to(model.device).half()
   
    variance_ratio = rep_reader.variance_ratio[layer][0]
    # print(variance_ratio)
    activations_with_variance[layer] = torch.tensor(coeff_with_variance * rep_reader.directions[layer] * rep_reader.direction_signs[layer] * variance_ratio).to(model.device).half()

Applying this method seems to allow all the 7B models I've tested to adapt a common coeff value, approximately around 2.0.

Theoretically, I came up with this idea when I saw the code of WrappedBlock using the controller (activations) to manipulate the tensor using a simple linear approach. So, I just take the variance_ratio into account in a most simple way. Maybe, by extracting the PCA model's underlying singular vector can gain better control over this.

Thanks for sharing this great work!

semicircle avatar Nov 21 '23 11:11 semicircle