representation-engineering
representation-engineering copied to clipboard
Dataset in example honest notebook
I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!
To extract functions (such as being honest), we're collecting neural activity at every token position in the response, as described in step 2 of the LAT scan in the paper.
Thank you for you reply!
I notice in the data process https://github.com/andyzoujm/representation-engineering/blob/f869e2c587341594343612f47048e1b86084fe93/examples/honesty/utils.py#L41, the input tokens are truncated, I want to know why not input all the tokens into the model? Thank you!
It was a design choice since unfinished sentences don't have strong indication of honesty/dishonesty. But it might not matter that much.
what is the difference between [true false] and [false true] in the label for honesty dataset
it is corresponding to the pairs in the train, in the train we randomly shuffle so some them have [0] index as honesty and some of them have [1] index as honesty
- Why true_statements is only considered for both training and testing? Why were false_statements not considered to generate untruthful_statements?
- Why are train labels randomly shuffled? Shouldn't they have [1] for honest_statements and [0] for untruthful_statements?
@andyzoujm , I'm trying to understand the dataset in honesty_function_dataset(). Only the true_statements are being used to create the train_set and I'm not sure why. I believe @Dakingrai also asked a similar question, but it was never answered
@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.
As to you other question, they need to shuffle in order to create variability on the axis of interest (honesty/dishonesty) I believe. Otherwise, when PCA is done on the difference, there's no variability over the pairs in the direction of the vector in that axis of interest.
See also: https://github.com/andyzoujm/representation-engineering/issues/23#issuecomment-2133801734
Thanks for the explanation, @joshlevy89 In the line following the shuffled test data, there's a strange operation
reshaped_data = np.array([[honest, untruthful] for honest, untruthful in
zip(honest_statements[:-1], untruthful_statements[1:])]).flatten()
test_data = reshaped_data[ntrain:ntrain*2].tolist()
Why does it skip the first and last of the honest and untruthful sets respectively? Is this just a way to ensure that the same text isn't picked as honest and untruthful?
@shamikbosefj hm, i'm not sure about that line either. i'm not sure what it's trying to accomplish. i think you could probably be replaced by something simpler...
test_data = np.array(combined_data[ntrain:ntrain+ntrain//2]).flatten().tolist()
@joshlevy89 I'm wondering if it's due to the random.shuffle() call earlier. That edits the array in place, so maybe they wanted to use the original values again?
@justinphan3110 @joshlevy89 @andyzoujm I think there's a bug in the dataset creation of the honesty_function_dataset() in utils.py . The size of the train_data is 1024, but the size of train_labels is 512 (See attached screenshot). This is different to the test_data where both data and labels have the same size (512). If this is not a bug, can someone please explain this discrepancy?
Updated utils.py
Result
@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.