representation-engineering icon indicating copy to clipboard operation
representation-engineering copied to clipboard

Dataset in example honest notebook

Open Zijian007 opened this issue 2 years ago • 13 comments

I notice in the paper and the example Jupyter code, the output of ASSISTANT(response), or the statement is truncated, I would like to know the reason. Thank you so much!

Zijian007 avatar Nov 07 '23 00:11 Zijian007

To extract functions (such as being honest), we're collecting neural activity at every token position in the response, as described in step 2 of the LAT scan in the paper.

andyzoujm avatar Nov 07 '23 00:11 andyzoujm

Thank you for you reply!

I notice in the data process https://github.com/andyzoujm/representation-engineering/blob/f869e2c587341594343612f47048e1b86084fe93/examples/honesty/utils.py#L41, the input tokens are truncated, I want to know why not input all the tokens into the model? Thank you!

Zijian007 avatar Nov 08 '23 05:11 Zijian007

It was a design choice since unfinished sentences don't have strong indication of honesty/dishonesty. But it might not matter that much.

andyzoujm avatar Nov 08 '23 06:11 andyzoujm

what is the difference between [true false] and [false true] in the label for honesty dataset

Jeffwang87 avatar Nov 17 '23 06:11 Jeffwang87

it is corresponding to the pairs in the train, in the train we randomly shuffle so some them have [0] index as honesty and some of them have [1] index as honesty

justinphan3110cais avatar Nov 17 '23 06:11 justinphan3110cais

  1. Why true_statements is only considered for both training and testing? Why were false_statements not considered to generate untruthful_statements?
  2. Why are train labels randomly shuffled? Shouldn't they have [1] for honest_statements and [0] for untruthful_statements?

Dakingrai avatar Feb 25 '24 13:02 Dakingrai

@andyzoujm , I'm trying to understand the dataset in honesty_function_dataset(). Only the true_statements are being used to create the train_set and I'm not sure why. I believe @Dakingrai also asked a similar question, but it was never answered

shamikbosefj avatar May 28 '24 13:05 shamikbosefj

@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.

As to you other question, they need to shuffle in order to create variability on the axis of interest (honesty/dishonesty) I believe. Otherwise, when PCA is done on the difference, there's no variability over the pairs in the direction of the vector in that axis of interest.

See also: https://github.com/andyzoujm/representation-engineering/issues/23#issuecomment-2133801734

joshlevy89 avatar May 28 '24 17:05 joshlevy89

Thanks for the explanation, @joshlevy89 In the line following the shuffled test data, there's a strange operation

reshaped_data = np.array([[honest, untruthful] for honest, untruthful in 
                          zip(honest_statements[:-1], untruthful_statements[1:])]).flatten()
test_data = reshaped_data[ntrain:ntrain*2].tolist()

Why does it skip the first and last of the honest and untruthful sets respectively? Is this just a way to ensure that the same text isn't picked as honest and untruthful?

shamikbosefj avatar May 29 '24 07:05 shamikbosefj

@shamikbosefj hm, i'm not sure about that line either. i'm not sure what it's trying to accomplish. i think you could probably be replaced by something simpler... test_data = np.array(combined_data[ntrain:ntrain+ntrain//2]).flatten().tolist()

joshlevy89 avatar May 29 '24 11:05 joshlevy89

@joshlevy89 I'm wondering if it's due to the random.shuffle() call earlier. That edits the array in place, so maybe they wanted to use the original values again?

shamikbosefj avatar May 29 '24 14:05 shamikbosefj

@justinphan3110 @joshlevy89 @andyzoujm I think there's a bug in the dataset creation of the honesty_function_dataset() in utils.py . The size of the train_data is 1024, but the size of train_labels is 512 (See attached screenshot). This is different to the test_data where both data and labels have the same size (512). If this is not a bug, can someone please explain this discrepancy?

Updated utils.py image

Result image

shamikbosefj avatar Jun 13 '24 13:06 shamikbosefj

@shamikbosefj i am not a contributor to this repo but my guess is that since the true_statements are getting truncated with the functional stimulation paradigm and prefixed with "imagine you are a truthful..." or "imagine you are an untruthful...", it doesn't really matter. In the end, the statement is not completed so it leaves the door open to whatever ideas for completion, depending on whether the LLM is asked to be truthful or untruthful . This is probably better than creating each pair out of a true and false statement because it reduces the amount of variability, and so the activations are likely to vary only on the honesty/dishonesty axis. Now I'm sure they could have also used false statements to create separate pairs (again, truncating the statement and prefixing as described) but they don't need that many pairs so it probably just wasn't necessary.

bb8609d611a8ba3a370d5ac21cf1aa1

ivyllll avatar Sep 30 '24 17:09 ivyllll