representation-engineering
representation-engineering copied to clipboard
Representation Engineering: A Top-Down Approach to AI Transparency
Hi! For the control phase, may I ask how you select the layers to be controlled? Thanks very much!
Hi In your honest scores calculation, what is the justification of results[pos][0][layer][0] * honesty_rep_reader.direction_signs[layer][0] Why you need to multiply by the direction sign, not just using the results[pos][0][layer][0] Thanks
What does the self function in the function of your customized pipeline refer to? Thanks
I found the calculation of the reading pipeline is super slow. I send the projection and recenter to GPU and make it run 20x faster. May I open a pull...
We get this error when doing rep-reading on google/gemma-2-2b-it: ``` ValueError: Input X contains NaN. PCA does not accept missing values encoded as NaN natively. For supervised learning, you might...
Thanks for sharing this exciting repo and I appreciate it a lot. I want to ask whether you have implemented the control methods introduced in the paper, e.g., contrast vector,...
What is the parameter in `llama_lorra_tqa_7b.sh` to reproduce the result in paper **55.0** (on TQA dataset) `{'tqa_accuracy': 0.42717258261933905, 'arc-e_accuracy': 0.6929824561403509}` data:image/s3,"s3://crabby-images/7631b/7631b20ca8f49fd8b309754aeaa40a954c497530" alt="Screenshot from 2024-07-23 17-47-48"
The LOSS function in the training flow should be minimizing the difference between the positive and negative hidden states. We don['t need the original activations right? **So there is no...
I am curious how you evaluate the in-domain generalization of the honesty probe. I found this in the paper `With this setup, the resulting LAT reading vector reaches a classification...
The repe_kwargs passed to generate will get lost. I assume it's due to things happen in the HuggingFace generate method. System info: I'm using the latest transformers