representation-engineering icon indicating copy to clipboard operation
representation-engineering copied to clipboard

Representation Engineering: A Top-Down Approach to AI Transparency

Results 20 representation-engineering issues
Sort by recently updated
recently updated
newest added

Hi! For the control phase, may I ask how you select the layers to be controlled? Thanks very much!

Hi In your honest scores calculation, what is the justification of results[pos][0][layer][0] * honesty_rep_reader.direction_signs[layer][0] Why you need to multiply by the direction sign, not just using the results[pos][0][layer][0] Thanks

What does the self function in the function of your customized pipeline refer to? Thanks

documentation

I found the calculation of the reading pipeline is super slow. I send the projection and recenter to GPU and make it run 20x faster. May I open a pull...

We get this error when doing rep-reading on google/gemma-2-2b-it: ``` ValueError: Input X contains NaN. PCA does not accept missing values encoded as NaN natively. For supervised learning, you might...

Thanks for sharing this exciting repo and I appreciate it a lot. I want to ask whether you have implemented the control methods introduced in the paper, e.g., contrast vector,...

What is the parameter in `llama_lorra_tqa_7b.sh` to reproduce the result in paper **55.0** (on TQA dataset) `{'tqa_accuracy': 0.42717258261933905, 'arc-e_accuracy': 0.6929824561403509}` ![Screenshot from 2024-07-23 17-47-48](https://github.com/user-attachments/assets/2e058f9c-98da-4850-af62-cf062f9f4288)

The LOSS function in the training flow should be minimizing the difference between the positive and negative hidden states. We don['t need the original activations right? **So there is no...

I am curious how you evaluate the in-domain generalization of the honesty probe. I found this in the paper `With this setup, the resulting LAT reading vector reaches a classification...

The repe_kwargs passed to generate will get lost. I assume it's due to things happen in the HuggingFace generate method. System info: I'm using the latest transformers